Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set sampling for lots of files with short texts? #60

Open
fishfree opened this issue Dec 9, 2024 · 0 comments
Open

How to set sampling for lots of files with short texts? #60

fishfree opened this issue Dec 9, 2024 · 0 comments

Comments

@fishfree
Copy link

fishfree commented Dec 9, 2024

I have a corpus of 2550 Chinese files, of which each file only contains about 5+ words, for example: 补天 济时 , 勿 认真 作 常言 。
If I don't set sampling, so many files cause sticking together as the screenshot below:
image

If I set sampling as below:
image
and other settings are as below:
image
image
image
It always reported error as below (even I set the sample size as 1):

> stylo()
using current directory...
Performing sampling (using sample size = 5 words)

slicing input text into tokens...

Error in make.samples(loaded.corpus, sample.size, sampling, sample.overlap) : 
  Corpus error...
In addition: Warning message:
In make.samples(loaded.corpus, sample.size, sampling, sample.overlap) : 

补天济时勿认真作常言...	This text is too short!

I noticed the text 补天济时勿认真作常言 in the error info deleted spaces in my provided text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant