You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)
When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.
When an optional subexpression's final term has an "X-or-more" modifier (? or *), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.
Consider the following regular expression: ((##[^\r\n]*)?\r?\n)+
This contains the optional subexpression: (##[^\r\n]*)?
This subexpression should match:
The empty string
Two consecutive # symbols followed by zero or more characters that are not CR or LF.
When the ParseFact parses the subexpression, it calls ParseAtom, which returns an NFA of the form:
Then, the '?' is read, and ParseFact calls ParseAtomModifier to rewrite the NFA. ParseAtomModifier computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:
Because there are outgoing transitions from S2, this changes the expression to (##)?[^\r\n]*, which matches:
The empty string
Any number of non-CR/LF characters
(As an aside, this turns my regex into ((##)?[^\r\n]*\r?\n)+, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)
The fix for this is simple: in ParseAtomModifier, before the "handle supported repeaters" comment:
Two consecutive '#' characters followed by any number of non-CR/LF characters.
Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.
The text was updated successfully, but these errors were encountered:
Fantastic bug report! Thank you for taking the time to detail the issue at such length. Amazing!
The C# and Java runtimes are manually kept in sync, where it makes sense. But the C# version has a slightly nicer API, as it supports properties (that Java didn't at the time). If you provide a patch for either one or both, it would be great!
(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)
When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.
When an optional subexpression's final term has an "X-or-more" modifier (
?
or*
), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.Consider the following regular expression:
((##[^\r\n]*)?\r?\n)+
This contains the optional subexpression:
(##[^\r\n]*)?
This subexpression should match:
#
symbols followed by zero or more characters that are not CR or LF.When the
ParseFact
parses the subexpression, it callsParseAtom
, which returns an NFA of the form:Then, the '?' is read, and
ParseFact
callsParseAtomModifier
to rewrite the NFA.ParseAtomModifier
computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:Because there are outgoing transitions from S2, this changes the expression to
(##)?[^\r\n]*
, which matches:(As an aside, this turns my regex into
((##)?[^\r\n]*\r?\n)+
, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)The fix for this is simple: in
ParseAtomModifier
, before the "handle supported repeaters" comment:This causes the method to produce the following NFA:
This will correctly match:
Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.
The text was updated successfully, but these errors were encountered: