You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a fledging collection of data sets created by the Distant Reader -- a library.
The result of the Distant Reader process is the creation of data sets called "study carrels". The student, research, or scholar can then use a study carrel to read the content of a carrel both closely as well as at a distance. The purpose of the library is to illustrate and demonstrate the breadth & depth of what can be created with the Reader. Presently, there are about 2,000 items in the collection.
Dataset modality
Text
Dataset licence
No response
Other licence
No response
How can you access this data
Other
size of dataset
No response
Confirm the dataset has an open licence
To the best of my knowledge, this dataset is accessible via an open licence
Contact details for data custodian
No response
The text was updated successfully, but these errors were encountered:
I'll try and take a closer look at this again next week but some initial thoughts below:
Just to re-iterate, the next step is for me to write a little Python script with a specific name and in a specific location. The script is really a set of classes and objects denoting features of my data sets. I then run another script that looks for the script I just wrote, and it will output data files of some sort, and those data files will go to 🌸? Finally, somebody will then be able to run something like above (ds=load_dataset('load_distant_reader', name='homer') to actually load my data sets. Correct?
This would be one way of managing this. I think it depends a little bit on how you think a user might be likely to interact with this dataset. Since you already support quite a few ways of working with the existing collections using your library, it probably doesn't make sense to recreate this behaviour.
It may be interesting to create a loading script that loads all of the texts in the catalogue (possibly allowing the user to specify some filters for defining what is collected from the catalogue.
This could work by starting from the JSON catalogue and then parsing the manifests for an item for the things that might be useful to include for ML research.
If so, then many of the examples in [1] allude to training sets, splits, and testing. To what degree is this required for me to denote? All of my data (plain text) file are saved in a single directory, and it is not divided into training and testing sets.
Since the data would most likely not be used for supervised training, it is okay to put everything into a single training split. This is often how other datasets for language modelling end up looking.
A URL for this dataset
http://library.distantreader.org/
Dataset description
Dataset modality
Text
Dataset licence
No response
Other licence
No response
How can you access this data
Other
size of dataset
No response
Confirm the dataset has an open licence
Contact details for data custodian
No response
The text was updated successfully, but these errors were encountered: