[ #71 ] warn about nullable regexs in the absence of start codes #155

andreasabel · 2020-01-26T09:18:03Z

Changed alex_scan_tkn to return AlexNone instead of AlexLastAcc or
AlexLastSkip if length of newly processed input is still 0.

This fixes the problem that Alex would produce an infinite sequence of
tokens of length 0 when the DFA was stuck in a state accepting a
nullable token.

Now it will either produce EOF (if at end of input) or a lexer error (if
not at the end of input). This is the intuitively expected behavior.

andreasabel · 2020-01-26T20:14:46Z

This fix requires the length to be computed correctly, see #119 (PR #156).

simonmar · 2020-01-27T08:17:47Z

If you define a token that can have length zero, returning an infinite sequence of those at the end of the input is correct, no? Shouldn't we just emit a warning and do what the user asked for?

andreasabel · 2020-01-30T08:01:49Z

Since Alex rejected nullable tokens before by simply crashing, this is the first time that nullable tokens are considered, and we are free to define their semantics. (Also, there is no need to rush the issue.)

My proposal is that all the tokens Alex ever produces will match at least 1 character of the input. I think this the intuitive expectation simply because the alternative, returning tokens that do not advance the input is useless. Unlike in parsers, where some non-terminals match the empty input, there is no way to request from the lexer "I would need this null token now, can you deliver it?" The lexer works in an autonomous way, giving the token that matches the longest piece of input; thus, it will only return a null token if nothing else is possible, ie, if it is stuck. And then, infinitely many null tokens. I cannot imagine a situation where this is useful behavior, can you?

On the contrary, allowing Alex to return null tokens destroys the following properties:

The length of the output is bounded by the length of the input.
This holds also for any prefix of the input (provided output is produced).
The lexer terminates on finite input.

Additionally to these conceptual arguments, one could make some field studies of what other lexers produce in this situation. I'd be surprised if any lexer out there produces an infinite stream of nullable tokens, and my initial experiments using BNFC generated parsers do not suggest so. The rule (ocamllex, flex, JLex) seems to give parse errors, ANTLR is the outlier with delivering an imho incorrect positive result (a token containing the string ""). But to be absolutely sure, I would have to tighten my tests to involve lexers only.

simonmar · 2020-01-30T08:07:09Z

I'm fine with requiring all tokens to match at least one byte/character, but then we should make it a static error for there to be nullable tokens, rather than just ignoring them.

andreasabel · 2020-01-30T08:15:44Z

we should make it a static error for there to be nullable tokens, rather than just ignoring them.

For consideration, the non-nullable version of

  a*b*c*d*

is

  a+b*c*d* | a*b+c*d* | a*b*c+d* | a*b*c*d+

which is a linear blowup.

Insisting on non-nullability of the RE rather than just to actual tokens produced that match the RE could place an unnecessary burden on the user. I observed ANTLR warning about nullable REs, I haven't observed any lexer generator giving an outright error.

simonmar · 2020-01-30T08:41:04Z

That makes sense. Can we give a warning then?

andreasabel · 2020-01-30T08:50:28Z

Yes, I think giving warnings is the appropriate thing.
I am currently a bit laden, I'll look into this when I cleared some things up.

andreasabel · 2020-01-31T10:24:32Z

[ #71 ] Emit warnings for nullable tokens.

The parser for .x files will now generate warnings for rules whose regular expression is nullable, i.e., matches the empty input.

Alex does not emit tokens that consume no input, thus, a warning is appropriate.

Warnings are kept as a stack in the state of the parse monad, and printed upon successful completion of the parser in Main.

Example (excerpt of issue_71.x):

  $whitespace  = [\ \n\t]
  @whitespaces = $whitespace*

  :-

  @whitespaces { \ _ _ -> Whitespaces }
               ^ Warning here!

Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string, but will be interpreted as not matching the empty string.

Since the parser does not generate abstract syntax with position information, it is hard to give the exact warning location, i.e., the location of the offending regular expression. To keep the changes minimal, we record the location of the token after the regular expression, i.e., the location where the code part begins. This approximate location should be good enough to jump to the regex in question.

Another problem is that the exact text of the regular expression is not printed in the warning, only what Show RExp gives us. This could be fixed if we had the exact location information of the regular expression; we could then cut the regular expression out of the input string. Alternatively, the parser could be modified to return a concrete syntax for the regular expression from which its original text could be recovered. The abstract reg.ex. would then be built from the concrete one in a second step.

At this point, I abstain from such invasive changes to Alex for the sake of improving this rare warning.

Bikeshedding on the warning text welcome:

Regular expression [..]* matches the empty string, but will be interpreted as not matching the empty string.

andreasabel · 2020-01-31T17:46:16Z

I just saw the empty token in Alex' own lexer:
https://github.com/simonmar/alex/blob/6c4db72c3bb3419c84740e2e9ae112ed0abd572e/src/Scan.x#L73-L76
Does my PR break the logic implemented here? (Having a bad feeling about it, since I had some spooky boot-strapping failures for Alex now and then. Travis might actually not test if the just generated Alex can be used to bootstrap Alex; it bootstraps with an released Alex only.)
I think skip needs to be replaced by a null action that simply does nothing.

UPDATE: I think I panicked for no reason. Ignore my comment. I did not understand what skip is doing.

andreasabel · 2020-02-01T09:25:31Z

Please do not merge yet.

First, I want to make sure travis also tests bootstrapping with the just generated Alex.

simonmar · 2020-02-06T08:37:48Z

So what happens with the afterstartcodes token in Alex now? Does it never match the empty stream? Wouldn't that break the logic?

andreasabel · 2020-02-06T11:06:26Z

So what happens with the afterstartcodes token in Alex now? Does it never match the empty stream? Wouldn't that break the logic?

Yeah, that is what I am worried about, but I haven't reached full understanding yet.

andreasabel · 2020-02-11T13:24:06Z

It seems that my reasoning about nullable tokens is incorrect once we have startcodes that change upon receiving an empty token. Since this changes the state of the lexer, producing one empty token does not mean that from then on, only empty tokens will be produced.

Thus, it is also questionable whether a warning should be emitted. Maybe a warning could be emitted unless startcodes are involved? (However, the motivation to act any further on issue #71 has diminished with the observation that empty tokens do serve a role in the lexer.)

The question is also what the user should do about the warning. It could be ignored, but some like warning-freeness. Then, how to get rid of the warning, e.g. in case one uses a regex like

  a*b*c*d*

The most convenient way would be a regex-operator that removes the empty word. I have seen such an operator used in theory, but in practice, it does not seem standard.
Well, maybe the user that likes warning-freeness has to bite the bullet and manually unfold to the lengthy

a+b*c*d* | a*b+c*d* | a*b*c+d* | a*b*c*d+

So, the conclusion of my musings: What could be salvaged from this PR?

Mechanism to issue a warning on nullable tokens if no startcodes are used.
Typos in comments I fixed.

simonmar · 2020-02-13T08:27:52Z

I imagine the warning would only make sense if we can prove that the lexer will produce a continuous stream of empty tokens at some point, i.e. it won't terminate. It's not clear to me exactly what conditions are needed for this though.

andreasabel · 2020-02-13T20:54:07Z

Mechanism to issue a warning on nullable tokens if no startcodes are used.

This is what is implemented now. The warning is:

Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string.

I imagine the warning would only make sense if we can prove that the lexer will produce a continuous stream of empty tokens at some point, i.e. it won't terminate.

Indeed.

It's not clear to me exactly what conditions are needed for this though.

If the lexer does not use start codes, it state does not change when emitting a null token, since it does not advance the input. Thus, it will make the same decision and again emit a null token, and again and again...

@whitespaces

The parser for .x files will now generate warnings for rules whose regular expression is nullable, i.e., matches the empty input. Exceptions are: 1. The user wrote regular expression (). In this case they probably meant to match the empty input. 2. Startcodes are involved. Then, matching the empty input might still change the lexer state, thus, the lexer might not continue to match the empty input. Warnings are kept as a stack in the state of the parse monad, and printed upon successful completion of the parser in Main. Example (excerpt of `issue_71.x`): ``` $whitespace = [\ \n\t] @whitespaces = $whitespace* :- @whitespaces { \ _ _ -> Whitespaces } ^ Warning here! Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string. ``` Since the parser does not generate abstract syntax with position information, it is hard to give the exact warning location, i.e., the location of the offending regular expression. To keep the changes minimal, we record the location of the token _after_ the regular expression, i.e., the location where the code part begins. This approximate location should be good enough to jump to the regex in question. Another problem is that the exact text of the regular expression is not printed in the warning, only what `Show RExp` gives us. This could be fixed if we had the exact location information of the regular expression; we could then cut the regular expression out of the input string. Alternatively, the parser could be modified to return a _concrete_ syntax for the regular expression from which its original text could be recovered. The abstract reg.ex. would then be built from the concrete one in a second step. At this point, I abstain from such invasive changes to Alex for the sake of improving this rare warning.

simonmar · 2020-02-24T08:34:55Z

Thanks!

andreasabel requested a review from simonmar January 26, 2020 20:13

andreasabel force-pushed the issue71 branch from e916351 to c693ba7 Compare January 31, 2020 10:16

[ whitespace ] removed trailing whitespace

d04ca8b

andreasabel force-pushed the issue71 branch from c693ba7 to 414141f Compare February 13, 2020 20:46

andreasabel changed the title ~~[ fixed #71 ] do not produce tokens of length 0~~ [ #71 ] warn about nullable regexs in the absence of start codes Feb 13, 2020

andreasabel force-pushed the issue71 branch from 414141f to 564f22a Compare February 13, 2020 20:55

simonmar merged commit 60d5932 into haskell:master Feb 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ #71 ] warn about nullable regexs in the absence of start codes #155

[ #71 ] warn about nullable regexs in the absence of start codes #155

andreasabel commented Jan 26, 2020

andreasabel commented Jan 26, 2020

simonmar commented Jan 27, 2020

andreasabel commented Jan 30, 2020

simonmar commented Jan 30, 2020

andreasabel commented Jan 30, 2020

simonmar commented Jan 30, 2020

andreasabel commented Jan 30, 2020

andreasabel commented Jan 31, 2020 •

edited

Loading

andreasabel commented Jan 31, 2020 •

edited

Loading

andreasabel commented Feb 1, 2020

simonmar commented Feb 6, 2020

andreasabel commented Feb 6, 2020

andreasabel commented Feb 11, 2020

simonmar commented Feb 13, 2020

andreasabel commented Feb 13, 2020

simonmar commented Feb 24, 2020

[ #71 ] warn about nullable regexs in the absence of start codes #155

[ #71 ] warn about nullable regexs in the absence of start codes #155

Conversation

andreasabel commented Jan 26, 2020

andreasabel commented Jan 26, 2020

simonmar commented Jan 27, 2020

andreasabel commented Jan 30, 2020

simonmar commented Jan 30, 2020

andreasabel commented Jan 30, 2020

simonmar commented Jan 30, 2020

andreasabel commented Jan 30, 2020

andreasabel commented Jan 31, 2020 • edited Loading

andreasabel commented Jan 31, 2020 • edited Loading

andreasabel commented Feb 1, 2020

simonmar commented Feb 6, 2020

andreasabel commented Feb 6, 2020

andreasabel commented Feb 11, 2020

simonmar commented Feb 13, 2020

andreasabel commented Feb 13, 2020

simonmar commented Feb 24, 2020

andreasabel commented Jan 31, 2020 •

edited

Loading

andreasabel commented Jan 31, 2020 •

edited

Loading