You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I am reading it, the TextReuseCorpus function has some safety check in order not to run tokenizers on documents that are too short, "too short" being documents too small to generate two ngrams of the requested size. In addition to that, the tokenizers seem to have their own assertions to prevent running with too short documents.
However, I have run into problems with skipgrams. First, the safety check in TextReuseCorpus lets documents pass that the assertion in tokenize_skip_ngrams then bails out on, because the latter assertion assumes a larger minimum document length. Second, I don't quite understand why the assertion would require this in the first place. IIUC, it's n + n * k - k <= length(words), but why would I not be able to generate skipgrams with the same document length as that of the ngram tokenizer (n < length(words)).
FWIW, I am trying to build large skipgrams, say, with n=15 and k = 3.
Have you tried using the skip-gram tokenizer in the tokenizers package? Those tokenizers will eventually replace the ones in this package. Note that their output format is somewhat different, so you will have to use them by passing the simplify = TRUE argument.
In general, this package is intended to let you drop in different tokenizers, so if the existing tokenizers do not meet your needs, you might consider writing a special case one.
As I am reading it, the TextReuseCorpus function has some safety check in order not to run tokenizers on documents that are too short, "too short" being documents too small to generate two ngrams of the requested size. In addition to that, the tokenizers seem to have their own assertions to prevent running with too short documents.
However, I have run into problems with skipgrams. First, the safety check in TextReuseCorpus lets documents pass that the assertion in tokenize_skip_ngrams then bails out on, because the latter assertion assumes a larger minimum document length. Second, I don't quite understand why the assertion would require this in the first place. IIUC, it's
n + n * k - k <= length(words)
, but why would I not be able to generate skipgrams with the same document length as that of the ngram tokenizer (n < length(words)
).FWIW, I am trying to build large skipgrams, say, with
n=15
andk = 3
.textreuse/R/tokenizers.R
Line 59 in 35f8421
Thanks for any pointers or insights.
The text was updated successfully, but these errors were encountered: