Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement greedy name parsing #2

Open
wants to merge 2 commits into
base: syntax-crate
Choose a base branch
from

Conversation

skaupper
Copy link

I implemented a first version of name parsing.

As already discussed through Matrix chat, this first implementation cannot fully differentiate some constructs and therefore clumps some tokens together in a generic Internal(...Tokens) node. This can be improved upon, but for simple names (especially selected names) it should work just fine.

Also, I added some test to show how the resulting structure may look like. The resulting SyntaxNode structure and naming is up for debate.

I implemented some (but definitely not all) of the name related production rules, like association_list, range and signature.


The big TODO is parsing of expression and subtype_indication. If there is a way to differentiate between these two on the fly, we could maybe avoid doing a second pass to parse Internal(...) nodes.

self.end_node();
} else {
// TODO: subtype_indication or expression?
self.start_node(Internal(SubtypeIndicationOrExpressionTokens));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a way to do the differentiation with limited lookahead, but I am a little afraid that it may become hard to reason about. That's something I can look into if you want.

If you want to keep it simple, we could try to do the parsing of the parenthesized name tails in a second pass using backtracking?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the simple solution. Maybe the second pass should also be pushed to the analysis stage when more information about the prefix is known. This would mimic the LRM.
My only preference is one can improve on this in the future without API breaking changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, should I get rid of the differentiation of what's inside the parenthesis altogether and just generate a RawTokens node? Otherwise we will sometimes end up having an actual SyntaxNode while in other cases we have RawTokens. Not sure how useful it would be to a user.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that makes sense at this point.

}

self.name_tail();
} else if is_start_of_attribute_name(self) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to this little stunt, the alias_declaration you pushed yesterday works out of the box :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever solution. Nice!

Copy link
Owner

@Schottkyc137 Schottkyc137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I feel like we're moving big steps into the right direction with this PR.
I have left some comments about what I think about some stuff you implemented, but most is up for discussion. Just let me know what you think.

vhdl_syntax/src/parser/productions/design.rs Show resolved Hide resolved

pub fn library_clause(&mut self) {
self.start_node(NodeKind::LibraryClause);
self.expect_token(Keyword(Kw::Library));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.expect_token(Keyword(Kw::Library));
self.expect_kw(Kw::Library);

This can make the code a bit leaner, but naturally this is no big deal


#[test]
fn parse_entity_with_context_clause() {
let (design, _) = "\
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a check call

let right_arrow_idx = match self.distance_to_closing_paren_or_token(Comma) {
Some(length) => self.lookahead_max_distance(length, [RightArrow]),
None => {
self.eof_err();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a pure EOF error should at some point be replaced by a more informative error (i.e., unbalanced parentheses, e.t.c.). Also, I think that there should be more stop tokens, such as semicolon, end, start of declarations and statements, e.t.c. for fault tolerant parsing.
But I'm more or less writing this down as a note so we don't forget it for more error tolerant parsing later. If you don't want to implement this right now, a todo comment or similar would be nice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh. I haven't put too much thought into error handling. This will eventually be an issue with all parser functions anyway, so I figured, going for a half baked solution right now isn't really worth it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like association_element has the same problem as actual_part: It can only be parsed, if it is followed by a comma or a closing parenthesis. Would you be happy with a similar solution like in range? We can look ahead to the end of the association_list while parsing function_call (or name in our case) and pass that information down to association_element and actual_part.

In that case I would create *_bounded functions which take the length as a parameter. The public functions would simply pass usize::MAX to their _bounded variants.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be happy with a similar solution like in range?

I think that would be better, yes. This sounds like a nice solution.

self.end_node();
}

pub fn actual_part(&mut self) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this implementation (see also the comment above). I'm afraid that this might hinder error tolerant parsing at some point, because if a user forgets a parenthesis, the whole remainder of the file will fail to parse. Yet this is a somewhat valid use-case for stuff like completion since a user might start typing port map ( formal => /*missing actual and closing parenthesis*/. In this example, no parsing and analysis of the rest of the file is possible.
But maybe the resolution is just the same as above in that simple more stop tokens should be introduced.

Another thing that I currently have upheld is that all public functions parse valid productions, which is not true for this function since it always required a Comma or closing paren at the end. In other words, actual_part should be able to parse foo.bar(2 downto 0), yet this would yield an EOF error with this implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like an approach like for range more? The caller of actual_part can look ahead to the end of it and pass its length to it. Other than range, actual_part would use this length to add the next N tokens to an RawTokens node.

}

self.name_tail();
} else if is_start_of_attribute_name(self) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever solution. Nice!

Comment on lines 201 to 229
match self.peek_token() {
Some(CommAt) => {
self.expect_token(CommAt);
self.identifier();
self.expect_token(Dot);
self.identifier();
self.expect_token(Dot);
self.identifier();
while self.opt_token(Dot) {
self.identifier();
}
}
Some(Dot) => {
self.expect_token(Dot);
self.partial_pathname();
}
Some(Circ | Identifier) => {
while self.opt_token(Circ) {
self.expect_token(Dot);
}
self.partial_pathname();
}
Some(_) => {
self.expect_tokens_err([CommAt, Dot, Circ, Identifier]);
}
None => {
self.eof_err();
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and probably some others) could be simplified by using the match_next_token resp. match_next_token_consume macro.

use crate::tokens::{Keyword as Kw, TokenKind::*, TokenStream};

impl<T: TokenStream> Parser<T> {
pub fn range(&mut self, max_length: usize) {
Copy link
Owner

@Schottkyc137 Schottkyc137 Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this to be some internal function (i.e., range_max_length or so) and a secondary range function that passes usize::MAX to the internal function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do that. Not sure how useful that will be, but at least it would be consistent.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also omit the public functions for now, but these functions that are not "standard conform" I would make pub(crate) for now. Making them public later is always easy. The inverse is not true.

Comment on lines 15 to 16
// `max_length` should give the distance between the current token position and the end of the range to parse.
// This way the parser can use a bounded lookahead to distinguish between range expressions (using `to` or `downto`) and attribute names.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice solution!

Comment on lines 9 to 10
ActualPartTokens,
SubtypeIndicationOrExpressionTokens,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably just make this an abstract Tokens (or RawTokens or so). I think that the context can unambiguously be derived from where the tokens are stored.

@skaupper
Copy link
Author

I think I got all your comments. To better support bounded lookaheads I introduced a token_index field to the parser.

Calling the unbounded Parser::actual_part now expects the whole token stream to be part of the production, which is not the case if it is followed by a token like RightPar. But I guess that is to be expected, when you call a normally bounded function with usize::MAX 🤷

Let me know what you think.


(We can also move the token_index field to the NodeBuilder. Other than saving a few self.token_index += 1; I don't see the benefit, though.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants