How to set sampling for lots of files with short texts? #60

fishfree · 2024-12-09T06:06:07Z

I have a corpus of 2550 Chinese files, of which each file only contains about 5+ words, for example: 补天济时，勿认真作常言。
If I don't set sampling, so many files cause sticking together as the screenshot below:

If I set sampling as below:

and other settings are as below:

It always reported error as below (even I set the sample size as 1):

> stylo()
using current directory...
Performing sampling (using sample size = 5 words)

slicing input text into tokens...

Error in make.samples(loaded.corpus, sample.size, sampling, sample.overlap) : 
  Corpus error...
In addition: Warning message:
In make.samples(loaded.corpus, sample.size, sampling, sample.overlap) : 

补天济时勿认真作常言...	This text is too short!

I noticed the text 补天济时勿认真作常言 in the error info deleted spaces in my provided text.

The text was updated successfully, but these errors were encountered:

ndrscalia · 2025-01-07T09:16:23Z

I think the problem is that you're asking Stylo to split your texts into samples of 5 features per each, but the selected feature seems to be words trigrams, which will exceed your texts' length. Not sure if I'm right, but you could try the following code:

my_corpus <- load.corpus.and.parse(
  corpus.dir = "data/corpus",
  markup.type = "txt",                              # this obviously depend on your input text
  corpus.lang = "JCK",
  sample.size = 1                                      # this could be changed according to your needs
  sampling = "normal.sampling",
  features = "w",
  ngram.size = 1,
  preserve.case = FALSE
)

Another problem could be in the tokenization. If the text is not correctly tokenized, you could want to tweak custom splitting rules using the function txt.to.words.ext.
But I'm not sure this would solve your problem, since with sampling you are multiplying the number of texts actually. Maybe, by sampling, you meant something that allows you to randomly test only a half (or a third etc.) of your corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set sampling for lots of files with short texts? #60

How to set sampling for lots of files with short texts? #60

fishfree commented Dec 9, 2024

ndrscalia commented Jan 7, 2025

How to set sampling for lots of files with short texts? #60

How to set sampling for lots of files with short texts? #60

Comments

fishfree commented Dec 9, 2024

ndrscalia commented Jan 7, 2025