-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfixes for UTF support (WIP) #226
base: master
Are you sure you want to change the base?
Conversation
Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following: PCRE2 version 10.34 2019-11-21 re> /\p{Cs}/match_invalid_utf data> \x{dfff}\=no_utf_check No match Several interrelated bug fixes: * An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be allowed with PCRE2_EXTRA_ALLOW_SURROGATES. Before this change non UTF-16 will not allow surrogates if encoded in the pattern directly. * In ALT_BSUX mode, any escaped character should represent itself if not "special", but that was not the case for not ASCII characters. * When match_invalid_utf is enabled a surrogate character in the subject was being skipped, even if no_utf_check was used together with allow_surrogates Issues that would be addressed with a new version * the documentation around \o and \x, specially when bigger than 255 needs reviewing for clarification. * pcre2test support for testing escaped UTF characters is suboptimal, so most testing was done with custom code that should be integrated instead. * JIT is broken when match_invalid_utf and allow_surrogate. Issues that are still open and had been punted from this version: * ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF. * PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might mean (\u00F1, assume binary matching like it seems to do with surrogates and match with an extended 0xf1 PCRE2_UCHAR, or error), if the later is something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?. * PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates. * PCRE2_EXTRA_ALT_BSUX should restrict the use of \, so for example non ASCII or \U throw an error. It probably need to break with PCRE2_ALT_SUX. * Behaviour is different between UTF-16 and UTF-32 but tests are still unified so tests are missing.
\x{09f} | ||
No match | ||
|
||
/^\p{Cs}/match_invalid_utf,allow_surrogates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this replicates the behaviour where utf is used instead of match_invalid_utf, by using the "allow_surrogates" to redefine what is valid, but maybe it shouldn't match instead since it is invalid UTF-16?
@@ -146,7 +146,7 @@ D is inspected during pcre2_dfa_match() execution | |||
|
|||
/* An additional compile options word is available in the compile context. */ | |||
|
|||
#define PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 0x00000001u /* C */ | |||
#define PCRE2_EXTRA_ALLOW_SURROGATES 0x00000001u /* C */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to keep the old flag and add PCRE2_EXTRA_ALLOW_SURROGATES, which will imply it and extend it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was already going to say that making the non-compatible change of removing an option is not something I'd want to do, especially as there is a way round it, as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and I note that the GitHub tests have failed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, I posted this as a work in progress, to help gather feedback on the approach and caveats, and a roundabout way to ask Zoltan for help implementing the new behaviour (use a flag to redefine what is invalid utf), so that match_invalid_utf and utf behave similarly and resolve the ambiguity when matching \p{Cs} in the JIT, if it was sound and correct for the interpreter.
Will update the change and make the RFC more explicit.
Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following:
PCRE2 version 10.34 2019-11-21
re> /\p{Cs}/match_invalid_utf
data> \x{dfff}=no_utf_check
No match
Several interrelated bug fixes:
Issues that would be addressed with a new version
Issues that are still open and had been punted from this version: