Implement greedy name parsing #2

skaupper · 2025-01-20T08:35:31Z

I implemented a first version of name parsing.

As already discussed through Matrix chat, this first implementation cannot fully differentiate some constructs and therefore clumps some tokens together in a generic Internal(...Tokens) node. This can be improved upon, but for simple names (especially selected names) it should work just fine.

Also, I added some test to show how the resulting structure may look like. The resulting SyntaxNode structure and naming is up for debate.

I implemented some (but definitely not all) of the name related production rules, like association_list, range and signature.

The big TODO is parsing of expression and subtype_indication. If there is a way to differentiate between these two on the fly, we could maybe avoid doing a second pass to parse Internal(...) nodes.

skaupper · 2025-01-20T08:39:30Z

vhdl_syntax/src/parser/productions/names.rs

+                self.end_node();
+            } else {
+                // TODO: subtype_indication or expression?
+                self.start_node(Internal(SubtypeIndicationOrExpressionTokens));


There may be a way to do the differentiation with limited lookahead, but I am a little afraid that it may become hard to reason about. That's something I can look into if you want.

If you want to keep it simple, we could try to do the parsing of the parenthesized name tails in a second pass using backtracking?

I think I prefer the simple solution. Maybe the second pass should also be pushed to the analysis stage when more information about the prefix is known. This would mimic the LRM.
My only preference is one can improve on this in the future without API breaking changes.

Ok, should I get rid of the differentiation of what's inside the parenthesis altogether and just generate a RawTokens node? Otherwise we will sometimes end up having an actual SyntaxNode while in other cases we have RawTokens. Not sure how useful it would be to a user.

Yes, I think that makes sense at this point.

skaupper · 2025-01-20T08:40:44Z

vhdl_syntax/src/parser/productions/names.rs

+            }
+
+            self.name_tail();
+        } else if is_start_of_attribute_name(self) {


Due to this little stunt, the alias_declaration you pushed yesterday works out of the box :)

Clever solution. Nice!

Schottkyc137

This looks really good! I feel like we're moving big steps into the right direction with this PR.
I have left some comments about what I think about some stuff you implemented, but most is up for discussion. Just let me know what you think.

vhdl_syntax/src/parser/productions/design.rs

Schottkyc137 · 2025-01-25T11:54:57Z

vhdl_syntax/src/parser/productions/design.rs

+
+    pub fn library_clause(&mut self) {
+        self.start_node(NodeKind::LibraryClause);
+        self.expect_token(Keyword(Kw::Library));


Suggested change

self.expect_token(Keyword(Kw::Library));

self.expect_kw(Kw::Library);

This can make the code a bit leaner, but naturally this is no big deal

Schottkyc137 · 2025-01-25T11:57:28Z

vhdl_syntax/src/parser/productions/design.rs

+
+    #[test]
+    fn parse_entity_with_context_clause() {
+        let (design, _) = "\


This should probably be a check call

Schottkyc137 · 2025-01-25T12:04:43Z

vhdl_syntax/src/parser/productions/interface.rs

+        let right_arrow_idx = match self.distance_to_closing_paren_or_token(Comma) {
+            Some(length) => self.lookahead_max_distance(length, [RightArrow]),
+            None => {
+                self.eof_err();


I think that a pure EOF error should at some point be replaced by a more informative error (i.e., unbalanced parentheses, e.t.c.). Also, I think that there should be more stop tokens, such as semicolon, end, start of declarations and statements, e.t.c. for fault tolerant parsing.
But I'm more or less writing this down as a note so we don't forget it for more error tolerant parsing later. If you don't want to implement this right now, a todo comment or similar would be nice.

Tbh. I haven't put too much thought into error handling. This will eventually be an issue with all parser functions anyway, so I figured, going for a half baked solution right now isn't really worth it.

It seems like association_element has the same problem as actual_part: It can only be parsed, if it is followed by a comma or a closing parenthesis. Would you be happy with a similar solution like in range? We can look ahead to the end of the association_list while parsing function_call (or name in our case) and pass that information down to association_element and actual_part.

In that case I would create *_bounded functions which take the length as a parameter. The public functions would simply pass usize::MAX to their _bounded variants.

Would you be happy with a similar solution like in range?

I think that would be better, yes. This sounds like a nice solution.

Schottkyc137 · 2025-01-25T12:15:40Z

vhdl_syntax/src/parser/productions/interface.rs

+        self.end_node();
+    }
+
+    pub fn actual_part(&mut self) {


I'm not so sure about this implementation (see also the comment above). I'm afraid that this might hinder error tolerant parsing at some point, because if a user forgets a parenthesis, the whole remainder of the file will fail to parse. Yet this is a somewhat valid use-case for stuff like completion since a user might start typing port map ( formal => /*missing actual and closing parenthesis*/. In this example, no parsing and analysis of the rest of the file is possible.
But maybe the resolution is just the same as above in that simple more stop tokens should be introduced.

Another thing that I currently have upheld is that all public functions parse valid productions, which is not true for this function since it always required a Comma or closing paren at the end. In other words, actual_part should be able to parse foo.bar(2 downto 0), yet this would yield an EOF error with this implementation.

Would you like an approach like for range more? The caller of actual_part can look ahead to the end of it and pass its length to it. Other than range, actual_part would use this length to add the next N tokens to an RawTokens node.

Schottkyc137 · 2025-01-25T12:35:18Z

vhdl_syntax/src/parser/productions/names.rs

+            }
+
+            self.name_tail();
+        } else if is_start_of_attribute_name(self) {


Clever solution. Nice!

Schottkyc137 · 2025-01-25T12:38:02Z

vhdl_syntax/src/parser/productions/names.rs

+        match self.peek_token() {
+            Some(CommAt) => {
+                self.expect_token(CommAt);
+                self.identifier();
+                self.expect_token(Dot);
+                self.identifier();
+                self.expect_token(Dot);
+                self.identifier();
+                while self.opt_token(Dot) {
+                    self.identifier();
+                }
+            }
+            Some(Dot) => {
+                self.expect_token(Dot);
+                self.partial_pathname();
+            }
+            Some(Circ | Identifier) => {
+                while self.opt_token(Circ) {
+                    self.expect_token(Dot);
+                }
+                self.partial_pathname();
+            }
+            Some(_) => {
+                self.expect_tokens_err([CommAt, Dot, Circ, Identifier]);
+            }
+            None => {
+                self.eof_err();
+            }
+        }


This (and probably some others) could be simplified by using the match_next_token resp. match_next_token_consume macro.

Schottkyc137 · 2025-01-25T12:40:28Z

vhdl_syntax/src/parser/productions/scalar_types.rs

+use crate::tokens::{Keyword as Kw, TokenKind::*, TokenStream};
+
+impl<T: TokenStream> Parser<T> {
+    pub fn range(&mut self, max_length: usize) {


I prefer this to be some internal function (i.e., range_max_length or so) and a secondary range function that passes usize::MAX to the internal function.

Can do that. Not sure how useful that will be, but at least it would be consistent.

You can also omit the public functions for now, but these functions that are not "standard conform" I would make pub(crate) for now. Making them public later is always easy. The inverse is not true.

Schottkyc137 · 2025-01-25T12:40:53Z

vhdl_syntax/src/parser/productions/scalar_types.rs

+        // `max_length` should give the distance between the current token position and the end of the range to parse.
+        // This way the parser can use a bounded lookahead to distinguish between range expressions (using `to` or `downto`) and attribute names.


Really nice solution!

Schottkyc137 · 2025-01-25T12:48:53Z

vhdl_syntax/src/syntax/node_kind.rs

+    ActualPartTokens,
+    SubtypeIndicationOrExpressionTokens,


I would probably just make this an abstract Tokens (or RawTokens or so). I think that the context can unambiguously be derived from where the tokens are stored.

skaupper · 2025-01-29T13:12:39Z

I think I got all your comments. To better support bounded lookaheads I introduced a token_index field to the parser.

Calling the unbounded Parser::actual_part now expects the whole token stream to be part of the production, which is not the case if it is followed by a token like RightPar. But I guess that is to be expected, when you call a normally bounded function with usize::MAX 🤷

Let me know what you think.

(We can also move the token_index field to the NodeBuilder. Other than saving a few self.token_index += 1; I don't see the benefit, though.)

Implement greedy name parsing

221349e

skaupper commented Jan 20, 2025

View reviewed changes

skaupper mentioned this pull request Jan 24, 2025

Type declaration parsing #3

Open

Schottkyc137 reviewed Jan 25, 2025

View reviewed changes

Improve (bounded) lookahead and simplify parsing of ambiguous names

566c862

skaupper force-pushed the name-parsing branch from 4c60dbb to 566c862 Compare January 29, 2025 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement greedy name parsing #2

Implement greedy name parsing #2

skaupper commented Jan 20, 2025

skaupper Jan 20, 2025

Schottkyc137 Jan 25, 2025

skaupper Jan 27, 2025

Schottkyc137 Jan 27, 2025

skaupper Jan 20, 2025

Schottkyc137 Jan 25, 2025

Schottkyc137 left a comment

Schottkyc137 Jan 25, 2025

Schottkyc137 Jan 25, 2025

Schottkyc137 Jan 25, 2025

skaupper Jan 27, 2025

skaupper Jan 27, 2025

Schottkyc137 Jan 27, 2025

Schottkyc137 Jan 25, 2025

skaupper Jan 27, 2025

Schottkyc137 Jan 25, 2025

Schottkyc137 Jan 25, 2025

Schottkyc137 Jan 25, 2025 •

edited

Loading

skaupper Jan 27, 2025

Schottkyc137 Jan 27, 2025

Schottkyc137 Jan 25, 2025

Schottkyc137 Jan 25, 2025

skaupper commented Jan 29, 2025

	self.expect_token(Keyword(Kw::Library));
	self.expect_kw(Kw::Library);

		// `max_length` should give the distance between the current token position and the end of the range to parse.
		// This way the parser can use a bounded lookahead to distinguish between range expressions (using `to` or `downto`) and attribute names.

Implement greedy name parsing #2

Are you sure you want to change the base?

Implement greedy name parsing #2

Conversation

skaupper commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Schottkyc137 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Schottkyc137 Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skaupper commented Jan 29, 2025

Schottkyc137 Jan 25, 2025 •

edited

Loading