Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between unrecognized and missing input in the CST when recovering parse #835

Closed
Xanewok opened this issue Feb 19, 2024 · 13 comments · Fixed by #1013
Closed

Distinguish between unrecognized and missing input in the CST when recovering parse #835

Xanewok opened this issue Feb 19, 2024 · 13 comments · Fixed by #1013
Assignees

Comments

@Xanewok
Copy link
Contributor

Xanewok commented Feb 19, 2024

Split from #640.

The rationale for this is we currently use SKIPPED for when we both recover past an unexpected stream of characters or we attempt to create a node despite some characters missing. In the latter case, we have SKIPPED node with contents "", so this might be misleading.

@Xanewok
Copy link
Contributor Author

Xanewok commented May 23, 2024

@AntonyBlakey @OmarTawfik @ggiraldez let's have the discussion about the third tree variant here.

@OmarTawfik
Copy link
Contributor

OmarTawfik commented May 31, 2024

Mentioning an alternative that we discussed in later meetings, when working on parser error recovery/AST construction:

  • For characters that cannot be scanned into any terminal in the current lexical context, we create a regular Token with TerminalKind::Unrecognized. They will be added to the parent without a label, similar to what we do with trivia. Lexer can continue to scan for the same expected token afterwards.
  • For missing tokens that we expect to parse, for example Identifier, we create a regular Token with the expected kind TerminalKind::Identifier, and an empty string for contents. We add it to the parent along with the expected label as usual.

This has the following benefits:

  • The CST maintains its continuity, and we can unparse/reconstruct the input source by iterating over all tokens in-order as usual.
  • Created tokens can stay in the CST, there is no need to post-process it to remove any of them.
  • Using labels, external users (and internal APIs like AST construction) can find/select/query for all expected children correctly, skipping unrecognized tokens, similar to how they already skip trivia. Their queries don't need to have special cases for when an extra random character is inserted by mistake somewhere, or a bracket was deleted.
  • Both cases will additionally generate a parse error, attaching its location to the respective token location.
  • As opposed to adding more Node variants, this doesn't force us or the user to match on/handle the additional variants in 99.9999% of the cases where no such variants would exist on the tree. We also don't need to convert/lower to an extra type for correct input.

Possible downsides:

  • The "expected" tokens would have an empty string for contents, which we need to clearly document/explain its purpose. Although I don't think that is a big concern, given that the token will already have a parse error with this info/details on it.

Suggestions? Alternatives?

@AntonyBlakey
Copy link
Contributor

99.9999%

Six 9's is a pretty strong assertion!

@OmarTawfik
Copy link
Contributor

Six 9's is a pretty strong assertion!

This is based on the assumption that the main use case for error nodes would be hardhat-vscode for the foreseeable future, and that most static code analysis we anticipate is performed on pre-validated contracts.

@Xanewok
Copy link
Contributor Author

Xanewok commented Jun 5, 2024

To circle back a bit and try to crystallize the underlying problem a bit, I feel like the main issue that you'd like to tackle @OmarTawfik (in the way that's different from what we currently do in main branch) is specifically the "missing" kind.

The proposed above TerminalKind::Unrecognized and this PR's Node::Unrecognized is the same as having the current TerminalKind::SKIPPED, which containing the skipped text, so I think we all agree on the solution there for this case.

However, I don't think it's possible to synthesize an empty terminal node with guessed kind for missing input in the general case as we could reasonably expect a set of multiple tokens that would allow us to progress rather than just a single one.

If I understand correctly or not missing anything:

  1. We would have to pick a guessed parse for the user from the language, which would be surprising, but also
  2. there is no way to reasonably guess and synthesize missing input - imagine the cases where we're recovering from a missing Expression. That's also a hard problem and we'd be writing a fuzz tester/syntax autocompletion at that point.

There are some cases where we only expect a single terminal (e.g. ;) in which case we could use the strategy above but since it's not possible in the general case, I'd prefer not to do this to not have two ways to represent holes in the CST.

If the key motivation is to discern between the "unrecognized" and the "missing" errors, then we could probably introduce another TerminalKind::HoleOrMissingOrBikesheddable? This way, we keep using the side-channel for richer errors like expected set of terminals that would allow us to progress (rather than this PR's Node::Missing(kinds)) but we'd be a bit more explicit about the kind of recovery that was triggered (holes vs skipping unrecognized stuff).

@Xanewok Xanewok changed the title Add dedicated CST Unrecognized and Missing variants when recovering parse Distinguish between unrecognized and missing input in the CST when recovering parse Jun 5, 2024
@Xanewok
Copy link
Contributor Author

Xanewok commented Jun 5, 2024

The original solution from the PR title is not what we agree on, so I changed the title to reflect what the underlying problem seems to be; feel free to update it if we can phrase it better.

@OmarTawfik
Copy link
Contributor

The proposed above TerminalKind::Unrecognized and this PR's Node::Unrecognized is the same as having the current TerminalKind::SKIPPED, which containing the skipped text, so I think we all agree on the solution there for this case.

Then the only suggestion here then is to rename it to Unrecognized, because SKIPPED doesn't explain why it was skipped. Unrecognized does exactly this 👍

there is no way to reasonably guess and synthesize missing input - imagine the cases where we're recovering from a missing Expression. That's also a hard problem and we'd be writing a fuzz tester/syntax autocompletion at that point.

I think there are ways to easily get around this, by having DSL authors annotate Enum definitions (or places where the parser executes an ordered choice) with a variant to recover to, which will lead to a specific "expectation". But I agree, this is out of scope for this issue, and requires a larger discussion/design than what we want to do here.

I'd prefer not to do this to not have two ways to represent holes in the CST.

Agreed, but IIUC, there are two approaches here, since it is not possible to always come up with an exact "expected" TerminalKind:

  1. Not produce anything at all in this case: Just keep the tree intact.
  2. Produce TerminalKind::HoleOrMissingOrBikesheddable for holes.

I wonder if it is possible to do the first option? mainly because then it is just noise that doesn't provide additional value to the user, and may even be approximate/incorrect (for the above reasons you mentioned), and a TerminalKind::Hole basically means "this actually should have been a different existing TerminalKind variant, but we don't know/can't tell you what it is".

@Xanewok
Copy link
Contributor Author

Xanewok commented Jun 5, 2024

To be clear, producing holes would not be approximate or incorrect assuming we don't synthesize "valid" tokens - it's the approach that we already use on main and in this PR via Node::Missing. Also as a counter-point to the above, this would break the symmetry of emitting "unrecognized" vs "missing" - since we already have to contain unrecognized error node for tree completeness, not including the other type of error would have to be justified.

I'm fine with keeping a side-channel for the richer errors but I'm not 100% sold yet on the idea of completely eliding that information from the tree.

To answer the question whether a tree is valid, any downstream consumer would have to then carry around the equivalent of ParseOutput rather than just the tree itself. At the same time it feels like they would not be interested into which terminals were expected at some location (richer data) but just the node validity.

While I'm typing this response, I can also envision a system where the downstream consumers also can track that information in a separate side-table/query system, so I guess it depends on what our architecture will look like...

I'm sure @AntonyBlakey will have some opinions and preference as well.


However, it definitely feels that we lean towards not having a third Node variant and also renaming the SKIPPED to Unrecognized for now, assuming we can easily switch error recovery to use this side-channel to query node completeness rather than the in-tree error nodes.

@AntonyBlakey
Copy link
Contributor

Side-channeling is easy because we can index by utf-8 position and provide a nice API that allows arbitrary typed attachments per node. If the side-channel is sorted then correlation is always log(n). This is ECS.

@OmarTawfik
Copy link
Contributor

To answer the question whether a tree is valid, any downstream consumer would have to then carry around the equivalent of ParseOutput rather than just the tree itself. At the same time it feels like they would not be interested into which terminals were expected at some location (richer data) but just the node validity.

Fair point. In that case, what do you think about using a token with TerminalKind::Missing and an empty string? This would solve both of our concerns above.

Also, would it be possible for these nodes to deterministically have the "correct" original EdgeLabel attached to it? at least that would give users a hint of "what" is missing. Otherwise, it would have to have a None label.

assuming we can easily switch error recovery to use this side-channel to query node completeness rather than the in-tree error nodes.

My understanding is that we want to track this information via a field on the ParserResult instead, and propagate that field upwards when unwrapping/wrapping new ParseResult instances. This is opposite of the current approach of checking the sub-tree recursively on every step, because of the large perf cost that we pay even for correct/valid input, not just during error recovery. Is that still the case?

@Xanewok
Copy link
Contributor Author

Xanewok commented Jun 7, 2024

Fair point. In that case, what do you think about using a token with TerminalKind::Missing and an empty string?

This looks like my proposed solution of TerminalKind::HoleOrMissingOrBikesheddable, so I'm obviously fine with it 😅 I don't feel so strongly about the name; Missing sounds good.

Also, would it be possible for these nodes to deterministically have the "correct" original EdgeLabel attached to it?

Unless I'm missing something, we run to the exact same problem as with trying to reconstruct the nodes with "guessed" terminal kind - we need to guess what the user meant to write across multiple options and we're back at the problem of trying to complete the tree ourselves.

My understanding is that we want to track this information via a field on the ParserResult instead,

We didn't really explore much or settle on a specific solution here yet, just agreed that it might be a problem perf-wise in pathological cases but also agreed to postpone it until we have a way to measure performance, first. I'm happy to revisit once we have a way to measure improvements and I have a bit more time to work on Slang 👍

Is that still the case?

Yes, at the moment we still check the structure for validity/evaluate it recursively when doing error recovery and attempting a best recovery for a choice.

@OmarTawfik
Copy link
Contributor

OmarTawfik commented Jun 10, 2024

This looks like my proposed solution of TerminalKind::HoleOrMissingOrBikesheddable, so I'm obviously fine with it 😅

Wonderful 😅 No bikeshedding needed if @AntonyBlakey agrees then!

we run to the exact same problem as with trying to reconstruct the nodes with "guessed" terminal kind

IIUC, the terminal issue is because we have multiple expectations, for example when we try to parse an EnumItem, and we are not sure which expectation to populate the tree with. But from the parent/calling parser POV, the error happens at a single node with a single label?

Not a blocker of course. We can always revisit this later, especially if this proves to be technically more work than it is worth.

Yes, at the moment we still check the structure for validity/evaluate it recursively when doing error recovery and attempting a best recovery for a choice.

I think we are also doing it for valid input, not just error recovery. For example Match::is_full_recursive(), which will happen for any ChoiceHelper even at root of the tree (like SourceUnitMember). I created #1005 for this so that we can track it.

I'm happy to revisit once we have a way to measure improvements and I have a bit more time to work on Slang

Sounds great 👍 Thank you!

@Xanewok
Copy link
Contributor Author

Xanewok commented Jun 16, 2024

I opened #1013 with the outlined fix.

However, when I was in the process of writing this code, I think again that just having one invalid kind seems better overall:

  • only one to worry about, the two are equivalent/someone might forget about handling the other
  • the MISSING is basically just UNRECOGNIZED with explicit check that the node is empty (it makes the runtime code slightly more complex without substantial enough benefit)
  • it now allows representing invalid/harder to reason about state, i.e. UNRECOGNIZED with empty contents or MISSING with non-empty contents

If the original concern here is that it's confusing to have SKIPPED/UNRECOGNIZED with empty contents, I wonder if the simplest solution might actually be the best one, which is renaming it to InvalidOrMissing (casing is irrelevant). This solves the problems listed above and it's pretty clear: if the contents are empty then it's missing, otherwise it's invalid. Actually discerning them gives us no value, here.

@AntonyBlakey @OmarTawfik what do you think?

github-merge-queue bot pushed a commit that referenced this issue Jun 20, 2024
Alternative to #969

Closes #835
Closes #507
Closes #700

This implements the idea from #835:
- introduces a new `TerminalKind::MISSING`
- renames `TerminalKind::SKIPPED` to `TerminalKind::UNRECOGNIZED`
- emits `TerminalKind::MISSING` instead of `TerminalKind::UNRECOGNIZED`
when the tree is empty

When writing this, I came to a conclusion that actually using two
distinc terminal kinds might do more harm than good here, see
#835 (comment).
@github-project-automation github-project-automation bot moved this from 🔄 In Progress to ✅ Done in Slang - 2024 H1 Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: ✅ Done
3 participants