Add dataset: clmet_3-1 #58

clancyoftheoverflow · 2022-07-14T16:01:07Z

A URL for this dataset

http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html

Dataset description

The Corpus of Late Modern English Texts, version 3.1 (CLMET3.1) is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. It incorporates CLMET, CLMETEV, and CLMET3.0, and has been compiled following roughly the same principles, that is:

The corpus covers the period 1710–1920, divided into three 70-year sub-periods.
The texts making up the corpus have all been written by British and Irish authors who are native speakers of English.
The corpus never contains more than three texts by the same author.
The texts within each sub-period have been written by authors born within a correspondingly restricted sub-period.

Size: 34 million words

Annotation: PoS-tagged; genre.

Dataset modality

Text

Dataset licence

Creative Commons Attribution Non Commercial Share Alike 4.0 International

Other licence

No response

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

davanstrien · 2022-07-14T16:23:07Z

This looks amazing!

shamikbose · 2022-07-16T22:13:18Z

#self-assign

shamikbose · 2022-07-17T23:31:53Z

#ready-for-review

shamikbose · 2022-07-17T23:32:16Z

@clancyoftheoverflow @davanstrien This dataset is live at https://huggingface.co/datasets/shamikbose89/clmet_3_1

davanstrien · 2022-07-18T14:30:22Z

@shamikbose thanks, I'll aim to review this today or tomorrow. @clancyoftheoverflow you probably know this dataset better than me, so feel free to also review it.

clancyoftheoverflow added the candidate-dataset Proposed dataset to be added label Jul 14, 2022

davanstrien added dataset Dataset to be added and removed candidate-dataset Proposed dataset to be added labels Jul 14, 2022

bigscience-workshop-projects bot added this to BigLAM: BigScience Libraries, Archives and Museums Jul 14, 2022

bigscience-workshop-projects bot moved this to Todo in BigLAM: BigScience Libraries, Archives and Museums Jul 14, 2022

github-actions bot assigned shamikbose Jul 16, 2022

github-actions bot added the ready for review Issue ready to be reviewed by maintainers label Jul 17, 2022

davanstrien self-assigned this Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset: clmet_3-1 #58

Add dataset: clmet_3-1 #58

clancyoftheoverflow commented Jul 14, 2022 •

edited

Loading

davanstrien commented Jul 14, 2022

shamikbose commented Jul 16, 2022

shamikbose commented Jul 17, 2022

shamikbose commented Jul 17, 2022

davanstrien commented Jul 18, 2022

Add dataset: clmet_3-1 #58

Add dataset: clmet_3-1 #58

Comments

clancyoftheoverflow commented Jul 14, 2022 • edited Loading

A URL for this dataset

Dataset description

Dataset modality

Dataset licence

Other licence

How can you access this data

Confirm the dataset has an open licence

Contact details for data custodian

davanstrien commented Jul 14, 2022

shamikbose commented Jul 16, 2022

shamikbose commented Jul 17, 2022

shamikbose commented Jul 17, 2022

davanstrien commented Jul 18, 2022

clancyoftheoverflow commented Jul 14, 2022 •

edited

Loading