-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement greedy name parsing #2
base: syntax-crate
Are you sure you want to change the base?
Conversation
self.end_node(); | ||
} else { | ||
// TODO: subtype_indication or expression? | ||
self.start_node(Internal(SubtypeIndicationOrExpressionTokens)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a way to do the differentiation with limited lookahead, but I am a little afraid that it may become hard to reason about. That's something I can look into if you want.
If you want to keep it simple, we could try to do the parsing of the parenthesized name tails in a second pass using backtracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the simple solution. Maybe the second pass should also be pushed to the analysis stage when more information about the prefix is known. This would mimic the LRM.
My only preference is one can improve on this in the future without API breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, should I get rid of the differentiation of what's inside the parenthesis altogether and just generate a RawTokens
node? Otherwise we will sometimes end up having an actual SyntaxNode
while in other cases we have RawTokens
. Not sure how useful it would be to a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that makes sense at this point.
} | ||
|
||
self.name_tail(); | ||
} else if is_start_of_attribute_name(self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to this little stunt, the alias_declaration
you pushed yesterday works out of the box :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever solution. Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! I feel like we're moving big steps into the right direction with this PR.
I have left some comments about what I think about some stuff you implemented, but most is up for discussion. Just let me know what you think.
|
||
pub fn library_clause(&mut self) { | ||
self.start_node(NodeKind::LibraryClause); | ||
self.expect_token(Keyword(Kw::Library)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.expect_token(Keyword(Kw::Library)); | |
self.expect_kw(Kw::Library); |
This can make the code a bit leaner, but naturally this is no big deal
|
||
#[test] | ||
fn parse_entity_with_context_clause() { | ||
let (design, _) = "\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a check
call
let right_arrow_idx = match self.distance_to_closing_paren_or_token(Comma) { | ||
Some(length) => self.lookahead_max_distance(length, [RightArrow]), | ||
None => { | ||
self.eof_err(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a pure EOF error should at some point be replaced by a more informative error (i.e., unbalanced parentheses, e.t.c.). Also, I think that there should be more stop tokens, such as semicolon
, end
, start of declarations and statements, e.t.c. for fault tolerant parsing.
But I'm more or less writing this down as a note so we don't forget it for more error tolerant parsing later. If you don't want to implement this right now, a todo comment or similar would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh. I haven't put too much thought into error handling. This will eventually be an issue with all parser functions anyway, so I figured, going for a half baked solution right now isn't really worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like association_element
has the same problem as actual_part
: It can only be parsed, if it is followed by a comma or a closing parenthesis. Would you be happy with a similar solution like in range
? We can look ahead to the end of the association_list
while parsing function_call
(or name
in our case) and pass that information down to association_element
and actual_part
.
In that case I would create *_bounded
functions which take the length
as a parameter. The public functions would simply pass usize::MAX
to their _bounded
variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be happy with a similar solution like in range?
I think that would be better, yes. This sounds like a nice solution.
self.end_node(); | ||
} | ||
|
||
pub fn actual_part(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about this implementation (see also the comment above). I'm afraid that this might hinder error tolerant parsing at some point, because if a user forgets a parenthesis, the whole remainder of the file will fail to parse. Yet this is a somewhat valid use-case for stuff like completion since a user might start typing port map ( formal => /*missing actual and closing parenthesis*/
. In this example, no parsing and analysis of the rest of the file is possible.
But maybe the resolution is just the same as above in that simple more stop tokens should be introduced.
Another thing that I currently have upheld is that all public functions parse valid productions, which is not true for this function since it always required a Comma
or closing paren at the end. In other words, actual_part
should be able to parse foo.bar(2 downto 0)
, yet this would yield an EOF error with this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like an approach like for range
more? The caller of actual_part
can look ahead to the end of it and pass its length to it. Other than range
, actual_part
would use this length to add the next N
tokens to an RawTokens
node.
} | ||
|
||
self.name_tail(); | ||
} else if is_start_of_attribute_name(self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever solution. Nice!
match self.peek_token() { | ||
Some(CommAt) => { | ||
self.expect_token(CommAt); | ||
self.identifier(); | ||
self.expect_token(Dot); | ||
self.identifier(); | ||
self.expect_token(Dot); | ||
self.identifier(); | ||
while self.opt_token(Dot) { | ||
self.identifier(); | ||
} | ||
} | ||
Some(Dot) => { | ||
self.expect_token(Dot); | ||
self.partial_pathname(); | ||
} | ||
Some(Circ | Identifier) => { | ||
while self.opt_token(Circ) { | ||
self.expect_token(Dot); | ||
} | ||
self.partial_pathname(); | ||
} | ||
Some(_) => { | ||
self.expect_tokens_err([CommAt, Dot, Circ, Identifier]); | ||
} | ||
None => { | ||
self.eof_err(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and probably some others) could be simplified by using the match_next_token
resp. match_next_token_consume
macro.
use crate::tokens::{Keyword as Kw, TokenKind::*, TokenStream}; | ||
|
||
impl<T: TokenStream> Parser<T> { | ||
pub fn range(&mut self, max_length: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this to be some internal function (i.e., range_max_length
or so) and a secondary range
function that passes usize::MAX
to the internal function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do that. Not sure how useful that will be, but at least it would be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also omit the public functions for now, but these functions that are not "standard conform" I would make pub(crate)
for now. Making them public later is always easy. The inverse is not true.
// `max_length` should give the distance between the current token position and the end of the range to parse. | ||
// This way the parser can use a bounded lookahead to distinguish between range expressions (using `to` or `downto`) and attribute names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice solution!
vhdl_syntax/src/syntax/node_kind.rs
Outdated
ActualPartTokens, | ||
SubtypeIndicationOrExpressionTokens, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably just make this an abstract Tokens
(or RawTokens
or so). I think that the context can unambiguously be derived from where the tokens are stored.
4c60dbb
to
566c862
Compare
I think I got all your comments. To better support bounded lookaheads I introduced a Calling the unbounded Let me know what you think. (We can also move the |
I implemented a first version of name parsing.
As already discussed through Matrix chat, this first implementation cannot fully differentiate some constructs and therefore clumps some tokens together in a generic
Internal(...Tokens)
node. This can be improved upon, but for simple names (especially selected names) it should work just fine.Also, I added some test to show how the resulting structure may look like. The resulting
SyntaxNode
structure and naming is up for debate.I implemented some (but definitely not all) of the name related production rules, like
association_list
,range
andsignature
.The big TODO is parsing of
expression
andsubtype_indication
. If there is a way to differentiate between these two on the fly, we could maybe avoid doing a second pass to parseInternal(...)
nodes.