-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ #71 ] warn about nullable regexs in the absence of start codes #155
Conversation
If you define a token that can have length zero, returning an infinite sequence of those at the end of the input is correct, no? Shouldn't we just emit a warning and do what the user asked for? |
Since Alex rejected nullable tokens before by simply crashing, this is the first time that nullable tokens are considered, and we are free to define their semantics. (Also, there is no need to rush the issue.) My proposal is that all the tokens Alex ever produces will match at least 1 character of the input. I think this the intuitive expectation simply because the alternative, returning tokens that do not advance the input is useless. Unlike in parsers, where some non-terminals match the empty input, there is no way to request from the lexer "I would need this null token now, can you deliver it?" The lexer works in an autonomous way, giving the token that matches the longest piece of input; thus, it will only return a null token if nothing else is possible, ie, if it is stuck. And then, infinitely many null tokens. I cannot imagine a situation where this is useful behavior, can you? On the contrary, allowing Alex to return null tokens destroys the following properties:
Additionally to these conceptual arguments, one could make some field studies of what other lexers produce in this situation. I'd be surprised if any lexer out there produces an infinite stream of nullable tokens, and my initial experiments using BNFC generated parsers do not suggest so. The rule (ocamllex, flex, JLex) seems to give parse errors, ANTLR is the outlier with delivering an imho incorrect positive result (a token containing the string ""). But to be absolutely sure, I would have to tighten my tests to involve lexers only. |
I'm fine with requiring all tokens to match at least one byte/character, but then we should make it a static error for there to be nullable tokens, rather than just ignoring them. |
For consideration, the non-nullable version of
is
which is a linear blowup. Insisting on non-nullability of the RE rather than just to actual tokens produced that match the RE could place an unnecessary burden on the user. I observed ANTLR warning about nullable REs, I haven't observed any lexer generator giving an outright error. |
That makes sense. Can we give a warning then? |
Yes, I think giving warnings is the appropriate thing. |
[ #71 ] Emit warnings for nullable tokens. The parser for .x files will now generate warnings for rules whose regular expression is nullable, i.e., matches the empty input. Alex does not emit tokens that consume no input, thus, a warning is appropriate. Warnings are kept as a stack in the state of the parse monad, and printed upon successful completion of the parser in Main. Example (excerpt of
Since the parser does not generate abstract syntax with position information, it is hard to give the exact warning location, i.e., the location of the offending regular expression. To keep the changes minimal, we record the location of the token after the regular expression, i.e., the location where the code part begins. This approximate location should be good enough to jump to the regex in question. Another problem is that the exact text of the regular expression is not printed in the warning, only what At this point, I abstain from such invasive changes to Alex for the sake of improving this rare warning. Bikeshedding on the warning text welcome:
|
I just saw the empty token in Alex' own lexer: UPDATE: I think I panicked for no reason. Ignore my comment. I did not understand what |
Please do not merge yet. First, I want to make sure travis also tests bootstrapping with the just generated Alex. |
So what happens with the |
Yeah, that is what I am worried about, but I haven't reached full understanding yet. |
It seems that my reasoning about nullable tokens is incorrect once we have startcodes that change upon receiving an empty token. Since this changes the state of the lexer, producing one empty token does not mean that from then on, only empty tokens will be produced. Thus, it is also questionable whether a warning should be emitted. Maybe a warning could be emitted unless startcodes are involved? (However, the motivation to act any further on issue #71 has diminished with the observation that empty tokens do serve a role in the lexer.) The question is also what the user should do about the warning. It could be ignored, but some like warning-freeness. Then, how to get rid of the warning, e.g. in case one uses a regex like
The most convenient way would be a regex-operator that removes the empty word. I have seen such an operator used in theory, but in practice, it does not seem standard.
So, the conclusion of my musings: What could be salvaged from this PR?
|
I imagine the warning would only make sense if we can prove that the lexer will produce a continuous stream of empty tokens at some point, i.e. it won't terminate. It's not clear to me exactly what conditions are needed for this though. |
This is what is implemented now. The warning is:
Indeed.
If the lexer does not use start codes, it state does not change when emitting a null token, since it does not advance the input. Thus, it will make the same decision and again emit a null token, and again and again... |
The parser for .x files will now generate warnings for rules whose regular expression is nullable, i.e., matches the empty input. Exceptions are: 1. The user wrote regular expression (). In this case they probably meant to match the empty input. 2. Startcodes are involved. Then, matching the empty input might still change the lexer state, thus, the lexer might not continue to match the empty input. Warnings are kept as a stack in the state of the parse monad, and printed upon successful completion of the parser in Main. Example (excerpt of `issue_71.x`): ``` $whitespace = [\ \n\t] @whitespaces = $whitespace* :- @whitespaces { \ _ _ -> Whitespaces } ^ Warning here! Warning: issue_71.x:24:14: Regular expression [..]* matches the empty string. ``` Since the parser does not generate abstract syntax with position information, it is hard to give the exact warning location, i.e., the location of the offending regular expression. To keep the changes minimal, we record the location of the token _after_ the regular expression, i.e., the location where the code part begins. This approximate location should be good enough to jump to the regex in question. Another problem is that the exact text of the regular expression is not printed in the warning, only what `Show RExp` gives us. This could be fixed if we had the exact location information of the regular expression; we could then cut the regular expression out of the input string. Alternatively, the parser could be modified to return a _concrete_ syntax for the regular expression from which its original text could be recovered. The abstract reg.ex. would then be built from the concrete one in a second step. At this point, I abstain from such invasive changes to Alex for the sake of improving this rare warning.
Thanks! |
Changed
alex_scan_tkn
to returnAlexNone
instead ofAlexLastAcc
orAlexLastSkip
if length of newly processed input is still 0.This fixes the problem that Alex would produce an infinite sequence of
tokens of length 0 when the DFA was stuck in a state accepting a
nullable token.
Now it will either produce EOF (if at end of input) or a lexer error (if
not at the end of input). This is the intuitively expected behavior.