-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish between unrecognized and missing input in the CST when recovering parse #835
Comments
@AntonyBlakey @OmarTawfik @ggiraldez let's have the discussion about the third tree variant here. |
Mentioning an alternative that we discussed in later meetings, when working on parser error recovery/AST construction:
This has the following benefits:
Possible downsides:
Suggestions? Alternatives? |
Six 9's is a pretty strong assertion! |
This is based on the assumption that the main use case for error nodes would be |
To circle back a bit and try to crystallize the underlying problem a bit, I feel like the main issue that you'd like to tackle @OmarTawfik (in the way that's different from what we currently do in The proposed above However, I don't think it's possible to synthesize an empty terminal node with guessed kind for missing input in the general case as we could reasonably expect a set of multiple tokens that would allow us to progress rather than just a single one. If I understand correctly or not missing anything:
There are some cases where we only expect a single terminal (e.g. If the key motivation is to discern between the "unrecognized" and the "missing" errors, then we could probably introduce another |
Unrecognized
and Missing
variants when recovering parse
The original solution from the PR title is not what we agree on, so I changed the title to reflect what the underlying problem seems to be; feel free to update it if we can phrase it better. |
Then the only suggestion here then is to rename it to
I think there are ways to easily get around this, by having DSL authors annotate
Agreed, but IIUC, there are two approaches here, since it is not possible to always come up with an exact "expected"
I wonder if it is possible to do the first option? mainly because then it is just noise that doesn't provide additional value to the user, and may even be approximate/incorrect (for the above reasons you mentioned), and a |
To be clear, producing holes would not be approximate or incorrect assuming we don't synthesize "valid" tokens - it's the approach that we already use on I'm fine with keeping a side-channel for the richer errors but I'm not 100% sold yet on the idea of completely eliding that information from the tree. To answer the question whether a tree is valid, any downstream consumer would have to then carry around the equivalent of While I'm typing this response, I can also envision a system where the downstream consumers also can track that information in a separate side-table/query system, so I guess it depends on what our architecture will look like... I'm sure @AntonyBlakey will have some opinions and preference as well. However, it definitely feels that we lean towards not having a third Node variant and also renaming the |
Side-channeling is easy because we can index by utf-8 position and provide a nice API that allows arbitrary typed attachments per node. If the side-channel is sorted then correlation is always log(n). This is ECS. |
Fair point. In that case, what do you think about using a token with Also, would it be possible for these nodes to deterministically have the "correct" original
My understanding is that we want to track this information via a field on the |
This looks like my proposed solution of
Unless I'm missing something, we run to the exact same problem as with trying to reconstruct the nodes with "guessed" terminal kind - we need to guess what the user meant to write across multiple options and we're back at the problem of trying to complete the tree ourselves.
We didn't really explore much or settle on a specific solution here yet, just agreed that it might be a problem perf-wise in pathological cases but also agreed to postpone it until we have a way to measure performance, first. I'm happy to revisit once we have a way to measure improvements and I have a bit more time to work on Slang 👍
Yes, at the moment we still check the structure for validity/evaluate it recursively when doing error recovery and attempting a best recovery for a choice. |
Wonderful 😅 No bikeshedding needed if @AntonyBlakey agrees then!
IIUC, the terminal issue is because we have multiple expectations, for example when we try to parse an Not a blocker of course. We can always revisit this later, especially if this proves to be technically more work than it is worth.
I think we are also doing it for valid input, not just error recovery. For example
Sounds great 👍 Thank you! |
I opened #1013 with the outlined fix. However, when I was in the process of writing this code, I think again that just having one invalid kind seems better overall:
If the original concern here is that it's confusing to have @AntonyBlakey @OmarTawfik what do you think? |
Alternative to #969 Closes #835 Closes #507 Closes #700 This implements the idea from #835: - introduces a new `TerminalKind::MISSING` - renames `TerminalKind::SKIPPED` to `TerminalKind::UNRECOGNIZED` - emits `TerminalKind::MISSING` instead of `TerminalKind::UNRECOGNIZED` when the tree is empty When writing this, I came to a conclusion that actually using two distinc terminal kinds might do more harm than good here, see #835 (comment).
Split from #640.
The rationale for this is we currently use
SKIPPED
for when we both recover past an unexpected stream of characters or we attempt to create a node despite some characters missing. In the latter case, we haveSKIPPED
node with contents""
, so this might be misleading.The text was updated successfully, but these errors were encountered: