Distinguish between unrecognized and missing input in the CST when recovering parse #835

Xanewok · 2024-02-19T13:43:31Z

Split from #640.

The rationale for this is we currently use SKIPPED for when we both recover past an unexpected stream of characters or we attempt to create a node despite some characters missing. In the latter case, we have SKIPPED node with contents "", so this might be misleading.

The text was updated successfully, but these errors were encountered:

Xanewok · 2024-05-23T16:25:29Z

@AntonyBlakey @OmarTawfik @ggiraldez let's have the discussion about the third tree variant here.

OmarTawfik · 2024-05-31T01:11:35Z

Mentioning an alternative that we discussed in later meetings, when working on parser error recovery/AST construction:

For characters that cannot be scanned into any terminal in the current lexical context, we create a regular Token with TerminalKind::Unrecognized. They will be added to the parent without a label, similar to what we do with trivia. Lexer can continue to scan for the same expected token afterwards.
For missing tokens that we expect to parse, for example Identifier, we create a regular Token with the expected kind TerminalKind::Identifier, and an empty string for contents. We add it to the parent along with the expected label as usual.

This has the following benefits:

The CST maintains its continuity, and we can unparse/reconstruct the input source by iterating over all tokens in-order as usual.
Created tokens can stay in the CST, there is no need to post-process it to remove any of them.
Using labels, external users (and internal APIs like AST construction) can find/select/query for all expected children correctly, skipping unrecognized tokens, similar to how they already skip trivia. Their queries don't need to have special cases for when an extra random character is inserted by mistake somewhere, or a bracket was deleted.
Both cases will additionally generate a parse error, attaching its location to the respective token location.
As opposed to adding more Node variants, this doesn't force us or the user to match on/handle the additional variants in 99.9999% of the cases where no such variants would exist on the tree. We also don't need to convert/lower to an extra type for correct input.

Possible downsides:

The "expected" tokens would have an empty string for contents, which we need to clearly document/explain its purpose. Although I don't think that is a big concern, given that the token will already have a parse error with this info/details on it.

Suggestions? Alternatives?

AntonyBlakey · 2024-06-02T22:26:28Z

99.9999%

Six 9's is a pretty strong assertion!

OmarTawfik · 2024-06-03T00:34:02Z

Six 9's is a pretty strong assertion!

This is based on the assumption that the main use case for error nodes would be hardhat-vscode for the foreseeable future, and that most static code analysis we anticipate is performed on pre-validated contracts.

Xanewok · 2024-06-05T08:56:07Z

To circle back a bit and try to crystallize the underlying problem a bit, I feel like the main issue that you'd like to tackle @OmarTawfik (in the way that's different from what we currently do in main branch) is specifically the "missing" kind.

The proposed above TerminalKind::Unrecognized and this PR's Node::Unrecognized is the same as having the current TerminalKind::SKIPPED, which containing the skipped text, so I think we all agree on the solution there for this case.

However, I don't think it's possible to synthesize an empty terminal node with guessed kind for missing input in the general case as we could reasonably expect a set of multiple tokens that would allow us to progress rather than just a single one.

If I understand correctly or not missing anything:

We would have to pick a guessed parse for the user from the language, which would be surprising, but also
there is no way to reasonably guess and synthesize missing input - imagine the cases where we're recovering from a missing Expression. That's also a hard problem and we'd be writing a fuzz tester/syntax autocompletion at that point.

There are some cases where we only expect a single terminal (e.g. ;) in which case we could use the strategy above but since it's not possible in the general case, I'd prefer not to do this to not have two ways to represent holes in the CST.

If the key motivation is to discern between the "unrecognized" and the "missing" errors, then we could probably introduce another TerminalKind::HoleOrMissingOrBikesheddable? This way, we keep using the side-channel for richer errors like expected set of terminals that would allow us to progress (rather than this PR's Node::Missing(kinds)) but we'd be a bit more explicit about the kind of recovery that was triggered (holes vs skipping unrecognized stuff).

Xanewok · 2024-06-05T09:08:38Z

The original solution from the PR title is not what we agree on, so I changed the title to reflect what the underlying problem seems to be; feel free to update it if we can phrase it better.

OmarTawfik · 2024-06-05T11:55:06Z

The proposed above TerminalKind::Unrecognized and this PR's Node::Unrecognized is the same as having the current TerminalKind::SKIPPED, which containing the skipped text, so I think we all agree on the solution there for this case.

Then the only suggestion here then is to rename it to Unrecognized, because SKIPPED doesn't explain why it was skipped. Unrecognized does exactly this 👍

there is no way to reasonably guess and synthesize missing input - imagine the cases where we're recovering from a missing Expression. That's also a hard problem and we'd be writing a fuzz tester/syntax autocompletion at that point.

I think there are ways to easily get around this, by having DSL authors annotate Enum definitions (or places where the parser executes an ordered choice) with a variant to recover to, which will lead to a specific "expectation". But I agree, this is out of scope for this issue, and requires a larger discussion/design than what we want to do here.

I'd prefer not to do this to not have two ways to represent holes in the CST.

Agreed, but IIUC, there are two approaches here, since it is not possible to always come up with an exact "expected" TerminalKind:

Not produce anything at all in this case: Just keep the tree intact.
Produce TerminalKind::HoleOrMissingOrBikesheddable for holes.

I wonder if it is possible to do the first option? mainly because then it is just noise that doesn't provide additional value to the user, and may even be approximate/incorrect (for the above reasons you mentioned), and a TerminalKind::Hole basically means "this actually should have been a different existing TerminalKind variant, but we don't know/can't tell you what it is".

Xanewok · 2024-06-05T13:02:18Z

To be clear, producing holes would not be approximate or incorrect assuming we don't synthesize "valid" tokens - it's the approach that we already use on main and in this PR via Node::Missing. Also as a counter-point to the above, this would break the symmetry of emitting "unrecognized" vs "missing" - since we already have to contain unrecognized error node for tree completeness, not including the other type of error would have to be justified.

I'm fine with keeping a side-channel for the richer errors but I'm not 100% sold yet on the idea of completely eliding that information from the tree.

To answer the question whether a tree is valid, any downstream consumer would have to then carry around the equivalent of ParseOutput rather than just the tree itself. At the same time it feels like they would not be interested into which terminals were expected at some location (richer data) but just the node validity.

While I'm typing this response, I can also envision a system where the downstream consumers also can track that information in a separate side-table/query system, so I guess it depends on what our architecture will look like...

I'm sure @AntonyBlakey will have some opinions and preference as well.

However, it definitely feels that we lean towards not having a third Node variant and also renaming the SKIPPED to Unrecognized for now, assuming we can easily switch error recovery to use this side-channel to query node completeness rather than the in-tree error nodes.

AntonyBlakey · 2024-06-05T14:58:48Z

Side-channeling is easy because we can index by utf-8 position and provide a nice API that allows arbitrary typed attachments per node. If the side-channel is sorted then correlation is always log(n). This is ECS.

OmarTawfik · 2024-06-07T08:10:17Z

To answer the question whether a tree is valid, any downstream consumer would have to then carry around the equivalent of ParseOutput rather than just the tree itself. At the same time it feels like they would not be interested into which terminals were expected at some location (richer data) but just the node validity.

Fair point. In that case, what do you think about using a token with TerminalKind::Missing and an empty string? This would solve both of our concerns above.

Also, would it be possible for these nodes to deterministically have the "correct" original EdgeLabel attached to it? at least that would give users a hint of "what" is missing. Otherwise, it would have to have a None label.

assuming we can easily switch error recovery to use this side-channel to query node completeness rather than the in-tree error nodes.

My understanding is that we want to track this information via a field on the ParserResult instead, and propagate that field upwards when unwrapping/wrapping new ParseResult instances. This is opposite of the current approach of checking the sub-tree recursively on every step, because of the large perf cost that we pay even for correct/valid input, not just during error recovery. Is that still the case?

Xanewok · 2024-06-07T13:51:27Z

Fair point. In that case, what do you think about using a token with TerminalKind::Missing and an empty string?

This looks like my proposed solution of TerminalKind::HoleOrMissingOrBikesheddable, so I'm obviously fine with it 😅 I don't feel so strongly about the name; Missing sounds good.

Also, would it be possible for these nodes to deterministically have the "correct" original EdgeLabel attached to it?

Unless I'm missing something, we run to the exact same problem as with trying to reconstruct the nodes with "guessed" terminal kind - we need to guess what the user meant to write across multiple options and we're back at the problem of trying to complete the tree ourselves.

My understanding is that we want to track this information via a field on the ParserResult instead,

We didn't really explore much or settle on a specific solution here yet, just agreed that it might be a problem perf-wise in pathological cases but also agreed to postpone it until we have a way to measure performance, first. I'm happy to revisit once we have a way to measure improvements and I have a bit more time to work on Slang 👍

Is that still the case?

Yes, at the moment we still check the structure for validity/evaluate it recursively when doing error recovery and attempting a best recovery for a choice.

OmarTawfik · 2024-06-10T12:12:36Z

This looks like my proposed solution of TerminalKind::HoleOrMissingOrBikesheddable, so I'm obviously fine with it 😅

Wonderful 😅 No bikeshedding needed if @AntonyBlakey agrees then!

we run to the exact same problem as with trying to reconstruct the nodes with "guessed" terminal kind

IIUC, the terminal issue is because we have multiple expectations, for example when we try to parse an EnumItem, and we are not sure which expectation to populate the tree with. But from the parent/calling parser POV, the error happens at a single node with a single label?

Not a blocker of course. We can always revisit this later, especially if this proves to be technically more work than it is worth.

Yes, at the moment we still check the structure for validity/evaluate it recursively when doing error recovery and attempting a best recovery for a choice.

I think we are also doing it for valid input, not just error recovery. For example Match::is_full_recursive(), which will happen for any ChoiceHelper even at root of the tree (like SourceUnitMember). I created #1005 for this so that we can track it.

I'm happy to revisit once we have a way to measure improvements and I have a bit more time to work on Slang

Sounds great 👍 Thank you!

Xanewok · 2024-06-16T19:21:39Z

I opened #1013 with the outlined fix.

However, when I was in the process of writing this code, I think again that just having one invalid kind seems better overall:

only one to worry about, the two are equivalent/someone might forget about handling the other
the MISSING is basically just UNRECOGNIZED with explicit check that the node is empty (it makes the runtime code slightly more complex without substantial enough benefit)
it now allows representing invalid/harder to reason about state, i.e. UNRECOGNIZED with empty contents or MISSING with non-empty contents

If the original concern here is that it's confusing to have SKIPPED/UNRECOGNIZED with empty contents, I wonder if the simplest solution might actually be the best one, which is renaming it to InvalidOrMissing (casing is irrelevant). This solves the problems listed above and it's pretty clear: if the contents are empty then it's missing, otherwise it's invalid. Actually discerning them gives us no value, here.

@AntonyBlakey @OmarTawfik what do you think?

Alternative to #969 Closes #835 Closes #507 Closes #700 This implements the idea from #835: - introduces a new `TerminalKind::MISSING` - renames `TerminalKind::SKIPPED` to `TerminalKind::UNRECOGNIZED` - emits `TerminalKind::MISSING` instead of `TerminalKind::UNRECOGNIZED` when the tree is empty When writing this, I came to a conclusion that actually using two distinc terminal kinds might do more harm than good here, see #835 (comment).

Xanewok self-assigned this Feb 19, 2024

Xanewok added this to Slang - 2024 H1 Feb 19, 2024

github-project-automation bot moved this to ⏳ Todo in Slang - 2024 H1 Feb 19, 2024

Xanewok mentioned this issue Feb 19, 2024

public API changes #640

Closed

7 tasks

This was referenced May 17, 2024

Parser returns an empty SKIPPED token for empty input #507

Closed

prevent empty SKIPPED nodes in the tree #700

Closed

Xanewok moved this from ⏳ Todo to 🔄 In Progress in Slang - 2024 H1 May 20, 2024

Xanewok mentioned this issue May 20, 2024

[blocked] Add a dedicated cst::Node::Invalid for incomplete/invalid CST nodes #969

Closed

Xanewok changed the title ~~Add dedicated CST Unrecognized and Missing variants when recovering parse~~ Distinguish between unrecognized and missing input in the CST when recovering parse Jun 5, 2024

Xanewok mentioned this issue Jun 16, 2024

Discern invalid and missing syntax with different terminal kinds #1013

Merged

Xanewok closed this as completed in #1013 Jun 20, 2024

github-project-automation bot moved this from 🔄 In Progress to ✅ Done in Slang - 2024 H1 Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between unrecognized and missing input in the CST when recovering parse #835

Distinguish between unrecognized and missing input in the CST when recovering parse #835

Xanewok commented Feb 19, 2024 •

edited

Loading

Xanewok commented May 23, 2024

OmarTawfik commented May 31, 2024 •

edited

Loading

AntonyBlakey commented Jun 2, 2024

OmarTawfik commented Jun 3, 2024

Xanewok commented Jun 5, 2024 •

edited

Loading

Xanewok commented Jun 5, 2024

OmarTawfik commented Jun 5, 2024

Xanewok commented Jun 5, 2024

AntonyBlakey commented Jun 5, 2024

OmarTawfik commented Jun 7, 2024

Xanewok commented Jun 7, 2024

OmarTawfik commented Jun 10, 2024 •

edited

Loading

Xanewok commented Jun 16, 2024

Distinguish between unrecognized and missing input in the CST when recovering parse #835

Distinguish between unrecognized and missing input in the CST when recovering parse #835

Comments

Xanewok commented Feb 19, 2024 • edited Loading

Xanewok commented May 23, 2024

OmarTawfik commented May 31, 2024 • edited Loading

AntonyBlakey commented Jun 2, 2024

OmarTawfik commented Jun 3, 2024

Xanewok commented Jun 5, 2024 • edited Loading

Xanewok commented Jun 5, 2024

OmarTawfik commented Jun 5, 2024

Xanewok commented Jun 5, 2024

AntonyBlakey commented Jun 5, 2024

OmarTawfik commented Jun 7, 2024

Xanewok commented Jun 7, 2024

OmarTawfik commented Jun 10, 2024 • edited Loading

Xanewok commented Jun 16, 2024

Xanewok commented Feb 19, 2024 •

edited

Loading

OmarTawfik commented May 31, 2024 •

edited

Loading

Xanewok commented Jun 5, 2024 •

edited

Loading

OmarTawfik commented Jun 10, 2024 •

edited

Loading