Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ #71 ] warn about nullable regexs in the absence of start codes #155

Merged
merged 2 commits into from
Feb 24, 2020

Conversation

andreasabel
Copy link
Member

Changed alex_scan_tkn to return AlexNone instead of AlexLastAcc or
AlexLastSkip if length of newly processed input is still 0.

This fixes the problem that Alex would produce an infinite sequence of
tokens of length 0 when the DFA was stuck in a state accepting a
nullable token.

Now it will either produce EOF (if at end of input) or a lexer error (if
not at the end of input). This is the intuitively expected behavior.

@andreasabel andreasabel requested a review from simonmar January 26, 2020 20:13
@andreasabel
Copy link
Member Author

This fix requires the length to be computed correctly, see #119 (PR #156).

@simonmar
Copy link
Member

If you define a token that can have length zero, returning an infinite sequence of those at the end of the input is correct, no? Shouldn't we just emit a warning and do what the user asked for?

@andreasabel
Copy link
Member Author

Since Alex rejected nullable tokens before by simply crashing, this is the first time that nullable tokens are considered, and we are free to define their semantics. (Also, there is no need to rush the issue.)

My proposal is that all the tokens Alex ever produces will match at least 1 character of the input. I think this the intuitive expectation simply because the alternative, returning tokens that do not advance the input is useless. Unlike in parsers, where some non-terminals match the empty input, there is no way to request from the lexer "I would need this null token now, can you deliver it?" The lexer works in an autonomous way, giving the token that matches the longest piece of input; thus, it will only return a null token if nothing else is possible, ie, if it is stuck. And then, infinitely many null tokens. I cannot imagine a situation where this is useful behavior, can you?

On the contrary, allowing Alex to return null tokens destroys the following properties:

  1. The length of the output is bounded by the length of the input.
  2. This holds also for any prefix of the input (provided output is produced).
  3. The lexer terminates on finite input.

Additionally to these conceptual arguments, one could make some field studies of what other lexers produce in this situation. I'd be surprised if any lexer out there produces an infinite stream of nullable tokens, and my initial experiments using BNFC generated parsers do not suggest so. The rule (ocamllex, flex, JLex) seems to give parse errors, ANTLR is the outlier with delivering an imho incorrect positive result (a token containing the string ""). But to be absolutely sure, I would have to tighten my tests to involve lexers only.

@simonmar
Copy link
Member

I'm fine with requiring all tokens to match at least one byte/character, but then we should make it a static error for there to be nullable tokens, rather than just ignoring them.

@andreasabel
Copy link
Member Author

we should make it a static error for there to be nullable tokens, rather than just ignoring them.

For consideration, the non-nullable version of

  a*b*c*d*

is

  a+b*c*d* | a*b+c*d* | a*b*c+d* | a*b*c*d+

which is a linear blowup.

Insisting on non-nullability of the RE rather than just to actual tokens produced that match the RE could place an unnecessary burden on the user. I observed ANTLR warning about nullable REs, I haven't observed any lexer generator giving an outright error.

@simonmar
Copy link
Member

That makes sense. Can we give a warning then?

@andreasabel
Copy link
Member Author

Yes, I think giving warnings is the appropriate thing.
I am currently a bit laden, I'll look into this when I cleared some things up.

@andreasabel
Copy link
Member Author

andreasabel commented Jan 31, 2020

[ #71 ] Emit warnings for nullable tokens.

The parser for .x files will now generate warnings for rules whose regular expression is nullable, i.e., matches the empty input.

Alex does not emit tokens that consume no input, thus, a warning is appropriate.

Warnings are kept as a stack in the state of the parse monad, and printed upon successful completion of the parser in Main.

Example (excerpt of issue_71.x):

  $whitespace  = [\ \n\t]
  @whitespaces = $whitespace*

  :-

  @whitespaces { \ _ _ -> Whitespaces }
               ^ Warning here!

Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string, but will be interpreted as not matching the empty string.

Since the parser does not generate abstract syntax with position information, it is hard to give the exact warning location, i.e., the location of the offending regular expression. To keep the changes minimal, we record the location of the token after the regular expression, i.e., the location where the code part begins. This approximate location should be good enough to jump to the regex in question.

Another problem is that the exact text of the regular expression is not printed in the warning, only what Show RExp gives us. This could be fixed if we had the exact location information of the regular expression; we could then cut the regular expression out of the input string. Alternatively, the parser could be modified to return a concrete syntax for the regular expression from which its original text could be recovered. The abstract reg.ex. would then be built from the concrete one in a second step.

At this point, I abstain from such invasive changes to Alex for the sake of improving this rare warning.

Bikeshedding on the warning text welcome:

Regular expression [..]* matches the empty string, but will be interpreted as not matching the empty string.

@andreasabel
Copy link
Member Author

andreasabel commented Jan 31, 2020

I just saw the empty token in Alex' own lexer:
https://github.com/simonmar/alex/blob/6c4db72c3bb3419c84740e2e9ae112ed0abd572e/src/Scan.x#L73-L76
Does my PR break the logic implemented here? (Having a bad feeling about it, since I had some spooky boot-strapping failures for Alex now and then. Travis might actually not test if the just generated Alex can be used to bootstrap Alex; it bootstraps with an released Alex only.)
I think skip needs to be replaced by a null action that simply does nothing.

UPDATE: I think I panicked for no reason. Ignore my comment. I did not understand what skip is doing.

@andreasabel
Copy link
Member Author

Please do not merge yet.

First, I want to make sure travis also tests bootstrapping with the just generated Alex.

@simonmar
Copy link
Member

simonmar commented Feb 6, 2020

So what happens with the afterstartcodes token in Alex now? Does it never match the empty stream? Wouldn't that break the logic?

@andreasabel
Copy link
Member Author

So what happens with the afterstartcodes token in Alex now? Does it never match the empty stream? Wouldn't that break the logic?

Yeah, that is what I am worried about, but I haven't reached full understanding yet.

@andreasabel
Copy link
Member Author

It seems that my reasoning about nullable tokens is incorrect once we have startcodes that change upon receiving an empty token. Since this changes the state of the lexer, producing one empty token does not mean that from then on, only empty tokens will be produced.

Thus, it is also questionable whether a warning should be emitted. Maybe a warning could be emitted unless startcodes are involved? (However, the motivation to act any further on issue #71 has diminished with the observation that empty tokens do serve a role in the lexer.)

The question is also what the user should do about the warning. It could be ignored, but some like warning-freeness. Then, how to get rid of the warning, e.g. in case one uses a regex like

  a*b*c*d*

The most convenient way would be a regex-operator that removes the empty word. I have seen such an operator used in theory, but in practice, it does not seem standard.
Well, maybe the user that likes warning-freeness has to bite the bullet and manually unfold to the lengthy

a+b*c*d* | a*b+c*d* | a*b*c+d* | a*b*c*d+

So, the conclusion of my musings: What could be salvaged from this PR?

  • Mechanism to issue a warning on nullable tokens if no startcodes are used.
  • Typos in comments I fixed.

@simonmar
Copy link
Member

I imagine the warning would only make sense if we can prove that the lexer will produce a continuous stream of empty tokens at some point, i.e. it won't terminate. It's not clear to me exactly what conditions are needed for this though.

@andreasabel andreasabel changed the title [ fixed #71 ] do not produce tokens of length 0 [ #71 ] warn about nullable regexs in the absence of start codes Feb 13, 2020
@andreasabel
Copy link
Member Author

Mechanism to issue a warning on nullable tokens if no startcodes are used.

This is what is implemented now. The warning is:

Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string.

I imagine the warning would only make sense if we can prove that the lexer will produce a continuous stream of empty tokens at some point, i.e. it won't terminate.

Indeed.

It's not clear to me exactly what conditions are needed for this though.

If the lexer does not use start codes, it state does not change when emitting a null token, since it does not advance the input. Thus, it will make the same decision and again emit a null token, and again and again...

The parser for .x files will now generate warnings for rules whose
regular expression is nullable, i.e., matches the empty input.

Exceptions are:

1. The user wrote regular expression ().
   In this case they probably meant to match the empty input.

2. Startcodes are involved.
   Then, matching the empty input might still change the lexer
   state, thus, the lexer might not continue to match the empty input.

Warnings are kept as a stack in the state of the parse monad, and
printed upon successful completion of the parser in Main.

Example (excerpt of `issue_71.x`):
```
  $whitespace  = [\ \n\t]
  @whitespaces = $whitespace*

  :-

  @whitespaces { \ _ _ -> Whitespaces }
               ^ Warning here!

Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string.
```

Since the parser does not generate abstract syntax with position
information, it is hard to give the exact warning location, i.e., the
location of the offending regular expression.  To keep the changes
minimal, we record the location of the token _after_ the regular
expression, i.e., the location where the code part begins.  This
approximate location should be good enough to jump to the regex in
question.

Another problem is that the exact text of the regular expression is not
printed in the warning, only what `Show RExp` gives us.  This could be
fixed if we had the exact location information of the regular
expression; we could then cut the regular expression out of the input
string.  Alternatively, the parser could be modified to return a
_concrete_ syntax for the regular expression from which its original
text could be recovered.  The abstract reg.ex. would then be built from
the concrete one in a second step.

At this point, I abstain from such invasive changes to Alex for the sake
of improving this rare warning.
@simonmar simonmar merged commit 60d5932 into haskell:master Feb 24, 2020
@simonmar
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants