Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset: distantreader #80

Open
1 task done
davanstrien opened this issue Aug 5, 2022 · 2 comments
Open
1 task done

Add dataset: distantreader #80

davanstrien opened this issue Aug 5, 2022 · 2 comments
Labels
candidate-dataset Proposed dataset to be added

Comments

@davanstrien
Copy link
Collaborator

A URL for this dataset

http://library.distantreader.org/

Dataset description

This is a fledging collection of data sets created by the Distant Reader -- a library.

The result of the Distant Reader process is the creation of data sets called "study carrels". The student, research, or scholar can then use a study carrel to read the content of a carrel both closely as well as at a distance. The purpose of the library is to illustrate and demonstrate the breadth & depth of what can be created with the Reader. Presently, there are about 2,000 items in the collection.

Dataset modality

Text

Dataset licence

No response

Other licence

No response

How can you access this data

Other

size of dataset

No response

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

@davanstrien davanstrien added the candidate-dataset Proposed dataset to be added label Aug 5, 2022
@davanstrien
Copy link
Collaborator Author

@ericleasemorgan I thought we could use this issue to discuss further best approach for this dataset :)

@davanstrien
Copy link
Collaborator Author

I'll try and take a closer look at this again next week but some initial thoughts below:

Just to re-iterate, the next step is for me to write a little Python script with a specific name and in a specific location. The script is really a set of classes and objects denoting features of my data sets. I then run another script that looks for the script I just wrote, and it will output data files of some sort, and those data files will go to 🌸? Finally, somebody will then be able to run something like above (ds=load_dataset('load_distant_reader', name='homer') to actually load my data sets. Correct?

This would be one way of managing this. I think it depends a little bit on how you think a user might be likely to interact with this dataset. Since you already support quite a few ways of working with the existing collections using your library, it probably doesn't make sense to recreate this behaviour.

It may be interesting to create a loading script that loads all of the texts in the catalogue (possibly allowing the user to specify some filters for defining what is collected from the catalogue.

This could work by starting from the JSON catalogue and then parsing the manifests for an item for the things that might be useful to include for ML research.

If so, then many of the examples in [1] allude to training sets, splits, and testing. To what degree is this required for me to denote? All of my data (plain text) file are saved in a single directory, and it is not divided into training and testing sets.

Since the data would most likely not be used for supervised training, it is okay to put everything into a single training split. This is often how other datasets for language modelling end up looking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
candidate-dataset Proposed dataset to be added
Projects
None yet
Development

No branches or pull requests

1 participant