Add dataset: royal_society_corpus #41

davanstrien · 2022-07-11T15:11:17Z

A URL for this dataset

https://fedora.clarin-d.uni-saarland.de/rsc/

Dataset description

The Royal Society Corpus (RSC) is based on the first two centuries of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1869. It includes all publications of the journal written mainly in English and containing running text. The Philosophical Transactions was the first periodical of scientific writing in England. Founded in 1665 by Henry Oldenburg, the first secretary of the Royal Society, it initially contained excerpts of letters of his scientific correspondence, reviews and summaries of recently-published books, and accounts of observations and experiments.

This offers an interesting dataset of text from the scientific domain across a long time period (1665-1869). Additionaly the dataset contains a range of annotations:

The corpus is tokenized and linguistically annotated for lemma and part-of-speech using TreeTagger (Schmid 1994, Schmid 1995). For spelling normalization we use a trained model of VARD (Baron and Rayson 2008). As a special feature, we encode with each unit (word token) its average surprisal, i.e. the average amount of information it encodes in number of bits, with words as units and trigram as contexts [cf. Genzel and Charniak 2002).
Detailed information on the linguistic and structural annotation of the RSC can be found here.

The RSC consists of approximately 35 million token and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity. Token sizes of the different subcorpora and other corpus statistics can be found here.

Dataset modality

Text

Dataset licence

Creative Commons Attribution Non Commercial Share Alike 4.0 International

Other licence

No response

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

[email protected]

davanstrien · 2022-07-11T15:11:59Z

Some additional information on the annotation format here https://fedora.clarin-d.uni-saarland.de/rsc/annotation.html

shamikbose · 2022-07-12T15:46:40Z

#self-assign

shamikbose · 2022-07-12T23:11:50Z

@davanstrien, there seems to be a problem with the data

</page>
<page no="page_0002" id="1070">
hands	NNS	hand	hands	2.09	0.65	1.12	1.50
.	SENT	.	.	2.39	0.29	3.09	4.43
</s>

The <s> tags are supposed to be inisde <page> tags, but as you can see above, the last line is a closing <s> tag without the opening one. As a result, both xml and `bs4' fail to parse it properly

I changed it to look like this:

our	PP$	our	our	6.09	0.40	5.53	4.47
hands	NNS	hand	hands	2.09	0.65	1.12	1.50
.	SENT	.	.	2.39	0.29	3.09	4.43
</s>
</page>
<page no="page_0002" id="1070">

and then it parses that part correctly

davanstrien · 2022-07-13T09:35:19Z

I'll take a quick look at this today. One option might be to use a slightly cruder approach to parsing. I'll play around a bit and let you know how I get on with that.

shamikbose · 2022-07-13T11:34:01Z

I was considering using a stack, but in the case of malformed data, the sentences would be wrong.

davanstrien added the candidate-dataset Proposed dataset to be added label Jul 11, 2022

davanstrien added dataset Dataset to be added and removed candidate-dataset Proposed dataset to be added labels Jul 11, 2022

bigscience-workshop-projects bot moved this to Todo in BigLAM: BigScience Libraries, Archives and Museums Jul 11, 2022

bigscience-workshop-projects bot added this to BigLAM: BigScience Libraries, Archives and Museums Jul 11, 2022

github-actions bot assigned shamikbose Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset: royal_society_corpus #41

Add dataset: royal_society_corpus #41

davanstrien commented Jul 11, 2022

davanstrien commented Jul 11, 2022

shamikbose commented Jul 12, 2022

shamikbose commented Jul 12, 2022

davanstrien commented Jul 13, 2022

shamikbose commented Jul 13, 2022

Add dataset: royal_society_corpus #41

Add dataset: royal_society_corpus #41

Comments

davanstrien commented Jul 11, 2022

A URL for this dataset

Dataset description

Dataset modality

Dataset licence

Other licence

How can you access this data

Confirm the dataset has an open licence

Contact details for data custodian

davanstrien commented Jul 11, 2022

shamikbose commented Jul 12, 2022

shamikbose commented Jul 12, 2022

davanstrien commented Jul 13, 2022

shamikbose commented Jul 13, 2022