Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore tokens matching character ranges #75

Open
voidless opened this issue May 14, 2021 · 4 comments
Open

Ignore tokens matching character ranges #75

voidless opened this issue May 14, 2021 · 4 comments

Comments

@voidless
Copy link

voidless commented May 14, 2021

Hi!
Is it possible to add an option to ignore character ranges for tokens?
If the whole token matches one ignored character set then it will be skipped. This will still prevent mixed languages in a word but will ignore languages with different character sets.

We (unfortunately) write some comments and strings in Russian and it triggers a Spellr warning almost every time
Simple dictionary checking doesn't work well with languages that has many cases (ex: Russian, Hindi) because you have to add all cases for each word to validate properly, and I was unable to find such dictionaries.

@voidless
Copy link
Author

I've found Russian dictionary with cases (35MB), it will work for our case

@robotdana
Copy link
Owner

hi did your found dictionary solve your problem?
is it a public dictionary that i could link for others in the documentation?
how is the performance of spellr with a 35MB wordlist?

@robotdana
Copy link
Owner

ignoring character range thing is interesting though, i'll look into that, because it's already a problem for chinese and other scripts that don't really use word breaks. it should be doable in the regex with ([[:alpha:]](?<!\p{Cyrillic}) or similar, i'll have a think about how to get that from the config to the regexes.

@voidless
Copy link
Author

I've used dictionary from this repo: https://github.com/danakt/russian-words
35MB is in unicode, original file was 2 times smaller in cp1251 encoding

Spellr completes in around 4 secs for 650k lines of code on my 6 core macbook

We are very happy with the results, now we spend less time on trivial errors during code review
We even found a few errors in our localization files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants