Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no-invalid-regex rule is probably broken #594

Open
bartlomieju opened this issue Jan 17, 2021 · 1 comment
Open

no-invalid-regex rule is probably broken #594

bartlomieju opened this issue Jan 17, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@bartlomieju
Copy link
Member

bartlomieju commented Jan 17, 2021

Reported by @RDambrosio016 on Discord:

i think this test is incorrect
assert_eq!(validator.validate_pattern("[\\c0-�]", false), Ok(()));

that range is out of order
the odd thing is v8 accepts it but other tools i tried dont
i think theres something weird going on with a test in dlint's regex validator tests, because [🌷-🌸] is checked for being invalid, but running dlint on a file with it doesnt yield any errors. I still dont know why v8 does not accept it, i think its something weird with utf16 code points because its valid by utf8 code points.
@bartlomieju bartlomieju added the bug Something isn't working label Jan 17, 2021
@bartlomieju
Copy link
Member Author

Further investigation by @RDambrosio016 https://discord.com/channels/684898665143206084/775366479143108608/800774226894258207

@bartlomieju yeah according to the spec, if the regex doesnt have /u then the chars are utf16 code points, if its parsed with /u then they are utf32 code points (rust chars)
i dont think it's very hard to fix the validator to treat code points right
as far as i know, it should be fine to just encode_utf16() on the char, then if its multiple code points then yield the first one
although utf8 makes this... weird
because i don't think its possible in utf8 to partially advance over a multi-codepoint char without being inside of a char boundary. :sweating:
i think for now im going to just keep 32 bit codepoints since:
- 16 bit codepoints are hard to get working correctly
- most people dont put multi codepoint chars in their regex
- it makes error reporting easier for rules like no-misleading-character-class

if that's fine for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant