You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a corpus of 2550 Chinese files, of which each file only contains about 5+ words, for example: 补天 济时 , 勿 认真 作 常言 。
If I don't set sampling, so many files cause sticking together as the screenshot below:
If I set sampling as below:
and other settings are as below:
It always reported error as below (even I set the sample size as 1):
> stylo()
using current directory...
Performing sampling (using sample size = 5 words)
slicing input text into tokens...
Error in make.samples(loaded.corpus, sample.size, sampling, sample.overlap) :
Corpus error...
In addition: Warning message:
In make.samples(loaded.corpus, sample.size, sampling, sample.overlap) :
补天济时勿认真作常言... This text is too short!
I noticed the text 补天济时勿认真作常言 in the error info deleted spaces in my provided text.
The text was updated successfully, but these errors were encountered:
I think the problem is that you're asking Stylo to split your texts into samples of 5 features per each, but the selected feature seems to be words trigrams, which will exceed your texts' length. Not sure if I'm right, but you could try the following code:
my_corpus <- load.corpus.and.parse(
corpus.dir = "data/corpus",
markup.type = "txt", # this obviously depend on your input text
corpus.lang = "JCK",
sample.size = 1 # this could be changed according to your needs
sampling = "normal.sampling",
features = "w",
ngram.size = 1,
preserve.case = FALSE
)
Another problem could be in the tokenization. If the text is not correctly tokenized, you could want to tweak custom splitting rules using the function txt.to.words.ext.
But I'm not sure this would solve your problem, since with sampling you are multiplying the number of texts actually. Maybe, by sampling, you meant something that allows you to randomly test only a half (or a third etc.) of your corpus.
I have a corpus of 2550 Chinese files, of which each file only contains about 5+ words, for example:
补天 济时 , 勿 认真 作 常言 。
If I don't set sampling, so many files cause sticking together as the screenshot below:
If I set sampling as below:
and other settings are as below:
It always reported error as below (even I set the sample size as 1):
I noticed the text
补天济时勿认真作常言
in the error info deleted spaces in my provided text.The text was updated successfully, but these errors were encountered: