Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set sampling for lots of files with short texts? #60

Open
fishfree opened this issue Dec 9, 2024 · 1 comment
Open

How to set sampling for lots of files with short texts? #60

fishfree opened this issue Dec 9, 2024 · 1 comment

Comments

@fishfree
Copy link

fishfree commented Dec 9, 2024

I have a corpus of 2550 Chinese files, of which each file only contains about 5+ words, for example: 补天 济时 , 勿 认真 作 常言 。
If I don't set sampling, so many files cause sticking together as the screenshot below:
image

If I set sampling as below:
image
and other settings are as below:
image
image
image
It always reported error as below (even I set the sample size as 1):

> stylo()
using current directory...
Performing sampling (using sample size = 5 words)

slicing input text into tokens...

Error in make.samples(loaded.corpus, sample.size, sampling, sample.overlap) : 
  Corpus error...
In addition: Warning message:
In make.samples(loaded.corpus, sample.size, sampling, sample.overlap) : 

补天济时勿认真作常言...	This text is too short!

I noticed the text 补天济时勿认真作常言 in the error info deleted spaces in my provided text.

@ndrscalia
Copy link

I think the problem is that you're asking Stylo to split your texts into samples of 5 features per each, but the selected feature seems to be words trigrams, which will exceed your texts' length. Not sure if I'm right, but you could try the following code:

my_corpus <- load.corpus.and.parse(
  corpus.dir = "data/corpus",
  markup.type = "txt",                              # this obviously depend on your input text
  corpus.lang = "JCK",
  sample.size = 1                                      # this could be changed according to your needs
  sampling = "normal.sampling",
  features = "w",
  ngram.size = 1,
  preserve.case = FALSE
)

Another problem could be in the tokenization. If the text is not correctly tokenized, you could want to tweak custom splitting rules using the function txt.to.words.ext.
But I'm not sure this would solve your problem, since with sampling you are multiplying the number of texts actually. Maybe, by sampling, you meant something that allows you to randomly test only a half (or a third etc.) of your corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants