Skip to content

Wiki-rnd dataset for small data distributional semantic evaluation (without gold standard), used in the paper "Evaluating the consistency of word embeddings from small data". This dataset contains automatically extracted terms from the index of Quine's Word & Object, and non-overlapping random samples of sentences from a 140M word preprocessed W…

Notifications You must be signed in to change notification settings

bloemj/Wiki-rnd

Repository files navigation

Wiki-rnd

Wiki-rnd dataset for small data distributional semantic evaluation (without gold standard), used in the paper "Evaluating the consistency of word embeddings from small data". This dataset contains 300 randomly sampled one-word Wikipedia page titles, the same ones used as a test set by Herbelot & Baroni (2017, "High-risk learning: acquiring new word vectors from tiny data"), and non-overlapping random samples of sentences from a 140M word preprocessed Wikipedia snapshot containing those terms.

In the sentences files, the format is one sentence per line. Of a line, the format is: target term, \t, sentence, \n. Within the sentence, the target term is marked with __xxNN, where NN is the number of the sample. For each target term, there are five samples, containing between N/5 and 10 non-overlapping random sentences, where N is the total number of sentences containing the target term.

About

Wiki-rnd dataset for small data distributional semantic evaluation (without gold standard), used in the paper "Evaluating the consistency of word embeddings from small data". This dataset contains automatically extracted terms from the index of Quine's Word & Object, and non-overlapping random samples of sentences from a 140M word preprocessed W…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published