Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expressions are evaluated incorrectly when an optional subexpression ends in a + or * #10

Open
Stevie-O opened this issue Aug 2, 2019 · 1 comment
Labels

Comments

@Stevie-O
Copy link

Stevie-O commented Aug 2, 2019

(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)

When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.

When an optional subexpression's final term has an "X-or-more" modifier (? or *), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.

Consider the following regular expression:
((##[^\r\n]*)?\r?\n)+
This contains the optional subexpression:
(##[^\r\n]*)?

This subexpression should match:

  • The empty string
  • Two consecutive # symbols followed by zero or more characters that are not CR or LF.

When the ParseFact parses the subexpression, it calls ParseAtom, which returns an NFA of the form:

S0:
  '#' -> S1
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Then, the '?' is read, and ParseFact calls ParseAtomModifier to rewrite the NFA.
ParseAtomModifier computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:

S0:
  '#' -> S1
  nil -> S2
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Because there are outgoing transitions from S2, this changes the expression to (##)?[^\r\n]*, which matches:

  • The empty string
  • Any number of non-CR/LF characters

(As an aside, this turns my regex into ((##)?[^\r\n]*\r?\n)+, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)

The fix for this is simple: in ParseAtomModifier, before the "handle supported repeaters" comment:

if (end.outgoing.Length > 0) {
   end = end.AddOut(new NFAEpsilonTransition(end));
}

This causes the method to produce the following NFA:

S0:
  '#' -> S1
  nil -> S3
S1:
  '#' -> S2
S2:
  [^\r\n] -> S2
  nil -> S3
S3: (accept/end state)

This will correctly match:

  • The empty string
  • Two consecutive '#' characters followed by any number of non-CR/LF characters.

Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.

@cederberg
Copy link
Owner

Fantastic bug report! Thank you for taking the time to detail the issue at such length. Amazing!

The C# and Java runtimes are manually kept in sync, where it makes sense. But the C# version has a slightly nicer API, as it supports properties (that Java didn't at the time). If you provide a patch for either one or both, it would be great!

@cederberg cederberg added the bug label Aug 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants