Ignore tokens matching character ranges #75

voidless · 2021-05-14T14:08:44Z

Hi!
Is it possible to add an option to ignore character ranges for tokens?
If the whole token matches one ignored character set then it will be skipped. This will still prevent mixed languages in a word but will ignore languages with different character sets.

We (unfortunately) write some comments and strings in Russian and it triggers a Spellr warning almost every time
Simple dictionary checking doesn't work well with languages that has many cases (ex: Russian, Hindi) because you have to add all cases for each word to validate properly, and I was unable to find such dictionaries.

voidless · 2021-05-14T14:41:03Z

I've found Russian dictionary with cases (35MB), it will work for our case

robotdana · 2021-06-12T08:24:13Z

hi did your found dictionary solve your problem?
is it a public dictionary that i could link for others in the documentation?
how is the performance of spellr with a 35MB wordlist?

robotdana · 2021-06-12T08:33:35Z

ignoring character range thing is interesting though, i'll look into that, because it's already a problem for chinese and other scripts that don't really use word breaks. it should be doable in the regex with ([[:alpha:]](?<!\p{Cyrillic}) or similar, i'll have a think about how to get that from the config to the regexes.

voidless · 2021-06-15T11:32:43Z

I've used dictionary from this repo: https://github.com/danakt/russian-words
35MB is in unicode, original file was 2 times smaller in cp1251 encoding

Spellr completes in around 4 secs for 650k lines of code on my 6 core macbook

We are very happy with the results, now we spend less time on trivial errors during code review
We even found a few errors in our localization files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore tokens matching character ranges #75

Ignore tokens matching character ranges #75

voidless commented May 14, 2021 •

edited

Loading

voidless commented May 14, 2021

robotdana commented Jun 12, 2021

robotdana commented Jun 12, 2021

voidless commented Jun 15, 2021

Ignore tokens matching character ranges #75

Ignore tokens matching character ranges #75

Comments

voidless commented May 14, 2021 • edited Loading

voidless commented May 14, 2021

robotdana commented Jun 12, 2021

robotdana commented Jun 12, 2021

voidless commented Jun 15, 2021

voidless commented May 14, 2021 •

edited

Loading