Error recovery #2038

PieterOlivier · 2024-10-04T07:34:10Z

This is a feature-tracking PR, will remain draft while we are working on the error recovery feature. Other smaller PRs will target this branch, and when ready we can merge this one. Where possible we try to do most of the reviewing in the open PRs that target the error-recovery branch, but sometimes stuff is missed there, and then it's better to comment on changes in this global-tracking-PR.

History:

in 2012 @jurgenvinju wrote an initial implementation of parser recovery that was based on user supplied recovery hints.
in 2022 @jurgenvinju resurrected that recovery attempt and got into a decent shape, now without user supplied recovery hints, but lacked bandwidth to finish it.
in 2024 @PieterOlivier picked this backup as part of swat.engineering's effort to invest in some larger scale rascal infrastructure improvements.
Implement basic error recovery #2020 the first PR that was reviewed and merged into this feature branch.
Test support for error recovery #2034 testing of error recovery in examples in multiple languages and intensive testing of Rascal sources files
Reimplemented error tree disambiguation in Java #2035
Error recovery #2033 an old tracking branch (we had to rename it to add CI support for branches targetting this PR)
Moved error recovery functions from ParseTree to util::ErrorRecovery #2042
Reimplemented findAllErrors function in Java #2043 implementation of findAllErrors in Java, needed for rascal-lsp
Support for parsing input that forms a valid prefix #2053
Trigger recovery when only recovery stacks are left #2057
Fixed size inconsistency between allocating and enlarging #2062
Added '\n' end matcher so end-of-line is also seen as skip terminator #2064
Feat/remove auto disambiguation #2075
TODO: more PRs to follow

…ree non-terminals

…arser implementation

Sometimes recovery nodes start before the current location where the parser failed to continue. Since the parser works with a short queue of schedulede TODO's around the current cursor, we might end up outside of this queue when recovering. This breaks several unspecified invariants of the SGTBF implementations. For now I added a detection that a recovery node is to be planned before the currently retained history and filter that recovery node. The next step will be to make sure backtracking over the current location is made possible.

…er input locations)

… because the next parser loop iteration always wants to advance one character

…imizations

…e version of Rascal and (b) the edit command is wired to the edit IDEService

…t in the scheme by copyinhthe contents to a tmp file

…-merge-experiment

MultiErrorBug condenses the Rascal syntax to only cover the problematic issue. The issue itself is of yet unsolved.

…cks-left Trigger recovery when only recovery stacks are left

Fixed size inconsistency between allocating and enlarging

Added '\n' end matcher so end-of-line is also seen as skip terminator

jurgenvinju · 2024-11-08T14:53:28Z

src/org/rascalmpl/values/RascalFunctionValueFactory.java

+            if (!allowAmbiguity && allowRecovery && filters.isEmpty()) {
+                // Filter error-induced ambiguities
+                RascalValueFactory valueFactory = (RascalValueFactory) ValueFactoryFactory.getValueFactory();
+                parseForest = (ITree) new ErrorRecovery(valueFactory).disambiguateErrors(parseForest, valueFactory.bool(false));


I think this disambiguation step is smart in most cases, but it should be optional and on-by-default. When we do repair or autocomplete of parse errors, then some of the filtered ambiguities could have been insightful for the user. A comparable algorithm would rank the alternatives, for example by the amount of skipped tokens; there are many different options which should be left to the language designer.

What happens if we skip this step here and leave it to the language designer in their parse function, or their services?

If you set allowAmbiguity=true a language designer already gets the opportunity to filter the trees themself. So it's already optional and on-by-default.

I feel the allowAmbiguity flag is meant to indicate something different. This option now conflates error recovery for a user with developer ambiguity in a grammar.

It's important for usability (performance) to still be able to exit tree construction quickly on unexpected ambiguity, and have predictable error recovery that represents all currently valid prefixes.

I don't know what the right solution is yet; but I think that allowAmbiguity=false should be ignored after error recovery

To make this less subtle-sounding:

A normal ambiguity is almost never intended, confuses the downstream tools as well as the user, and possibly leads to high polinomial tree constructing time (very slow). Typically we disallow ambiguity after grammar deployment because of this, while we need ambiguity to debug grammars at development time.

An error ambiguity is almost always inherent to the recovery process. The clusters represent the different viable prefixes at the moment the parser discovered it was stuck. The use needs all of them for useful features like auto-complete and quickfix. Disallowing those after deployment will hamper the usability of error recovery severely. And this is also flipped: Typically we don't need error recovery at grammar development time because we need to fix them ourselves in the grammar, while it is essential to turn on after deployment for usability.

In other words we do expect error ambiguïty but we don't expect grammatical ambiguity. So it can't be the same option.

It's good to know, as a kind of illustration, that error recovery for a single typo will produce ambiguous clusters of predictions on different levels, for different nonterminals in the tree. At the same time also on the same line, and same nonterminals, different prefixes could be active. As a result one recovered file {w,c,sh}ould receive quickfix proposals on different line/column positions. But if we already filtered heuristically because "ambiguity is not allowed", many of the valid and natural options will have been removed.

We could pass in an external error filtering function similar to the actions, if that is required for speed, but otherwise I think it would be best to not filter at this stage and leave the ranking to quickfix and autocomplete.

These are compelling arguments. So we should remove the automatic error tree disambiguation altogether and leave it to the developer. A developer can always specify a filter to do the disambiguation.

Currently the disambiguation function takes two arguments (the tree and a boolean specifying if "normal" ambiguities are allowed). Maybe we should provide two separate functions that only take a tree argument so the programmar can just use one of them directly as a parse filter?

Yes. We can have examples of reusable functions and filters in Recover.rsc

I agree that the ambiguity parameter interaction was too simplistic. But I want to add a few points for considiration.

There will be regular users and power users of this functionality. I think we should recognize that, and make sure it's also usable for people do do not want to wack a whole forest just to get syntax highlighting and maybe an outline. I think having a keyword parameter that says something like: keepOnlyShortestErrorTree or a smaller version, would help make this more accessible.

I think that since with the exception of features like auto-complete, the use of the whole forest is quite niche. There are quite some places where we'll run into error trees where we don't care for which of the trees in the forest is the right one, we just want to skip over the part and do something usefull for the rest.

I want to be able to run the parser outside the evaluator, also if there is an error tree filter. This is usefull for rascal-lsp both DSLs (where everything is inside the evaluator) and for rascal itself. SInce error recovery will run even more frequent, and might sometimes spend a bit more time to recover, I would love if it we can keep the currrent performance feature of running it on a regular java thread without having to lock the evaluator. Passing it a post parse filter function will (I think) now require a lock on the evaluator.

Thanks for the discussion!

It's exactly the normal users who have to be protected against heuristic filters. Their semantics is unpredictable while if we built it in it becomes a contract for their downstream tooling. The shortest error is truly nothing more than a blunt heuristic which misfires more than it doesn't. Especially if we want to provide grammar-derived repair for those simpler users the shortest error makes little sense, but I'm pretty sure it almost never is the "right" one.

We've been here before in the 2000's with "powerful" disambiguation filters. Removing them because of their semantic inaccuracy was a social-economic wasteland. People were not happy their trees "suddenly" became ambiguous, while they were fully ignorant of what they couldn't be aware of: the trees they actually should care about. Hiding ambiguity is a mine field that I want to stay clear of.

That's not true. The forest represents all the viable prefixes at the time of getting stuck. None of them are more likely to be the intended future representation, after repair, in general. It's worse even. Many of the predictions may have already failed and garbage collected, making the forest also incomplete in that sense. More incompleteness will make this only worse. Currently we have a declarative and complete contract for the output of error recovery; with any auto-filter the contract becomes ill-defined from a language semantics perspective.

Error recovery and repair or diagnostics algorithmes do not belong in the parser contribution, ever. The requirements on the downstream processor dictate how to deal with errors or multiple ambiguous errors. A repair Algorithms takes each prefix to complete, a type checker may want to skip entire expressions and statements that contain errors at all, a diagnostics algorithm may fuzz around the error positions and reparse, etc. Different filters for different backends. Filtering errors is not a syntax aspect, it's a semantics aspect.

In one LSP server, different services will deal differently with the errors and their clusters. The first priority should be to make sure that algorithms written by beginners fail either gracefully or stay accidentally robust in the presence of errors. Features like visit and dynamic dispatch already deal gracefully with ambiguity and errors (they simply don't match). To make Algorithms robust without thinking at ask all the users must do is add a default case that does nothing.

The dirty edges w still have the brush up are Rascal features that look into trees without pattern matching. See the pr on Field projection. that solution still has edge cases that I'd like to see buttoned up.

sungshik · 2024-11-11T13:56:10Z

src/org/rascalmpl/library/util/ErrorRecovery.rsc

+@synopsis{Check if a parse tree contains any error nodes, the result of error recovery.}
+bool hasErrors(Tree tree) = /appl(error(_, _, _), _) := tree;


The language server for Pico uses this function like this:

// definitions of variables rel[str, loc] defs = {<"<var.id>", var.src> | /IdType var := input, !hasErrors(var)};

Without context, it could be unclear that it's about parse errors. (This code in the Pico language server occurs in the definition of the analyzer, so it could also be about type errors.)

So, minor suggestion: Because it's user-facing, maybe the function could be renamed to hasParseErrors. (And by the same token, maybe the module could be renamed to ParseErrorRecovery.)

I'm looking for a solution where this function is not necessary. Since parse trees are a built-in feature of Rascal, there is opportunity for built-in features for error trees.

There's a branch off of this one, with an experimental semantics for field projection on error trees. Pattern matching already fails naturally on them but field projection fails too often on error trees, making existing code less robust.

Renamed 'symbol' field in skipped production to 'def'

…tats

jurgenvinju and others added 30 commits March 30, 2022 16:45

recovered Recoverer from git history

9deefa3

Merge branch 'main' into revive-robust-parsing

d1fdf89

fixed imports and whitespace

9ae895c

added documentation

4e43070

renamed recoverer and simplified to whitespace only and all context-f…

ff99ac9

…ree non-terminals

removed unused javadoc plugin that produced warnings

d2a0b66

fixed compiler warnings

d9439a4

fixed warnings

d87b4b8

fixed warnings

4897d88

added boolean parameter \'robust\' to parsing API

5803411

wired boolean parameter for robustness from Rascal function down to p…

3adee6c

…arser implementation

added override

7318e8f

minor additions to make recovery work and removed dead code

f6e13d8

recovered a missing piece from the recovery code

ba0fb43

added possibility for recovered nodes to start back in time (at earli…

6d5d49b

…er input locations)

fixed off-by-one: error nodes should be scheduled one character ahead…

78ee76b

… because the next parser loop iteration always wants to advance one character

fixed another off-by-one

d9f8fc9

added initial construction of skipped nodes

685198d

gave skipped productions a type such that all trees have a type

1d5435b

Merge branch 'main' into revive-robust-parsing

50edd09

updated template

5265ef0

bumped rascal-maven-plugin to 0.8.0 to see if we can benefit from opt…

a2162c9

…imizations

upgraded to rascal-maven-plugin 0.8.1

91fb32b

[maven-release-plugin] prepare release v0.23.1

018826f

[maven-release-plugin] prepare for next development iteration

7b6e8b8

make sure that (a) BasicIDEServices are registered for the commandlin…

4aefd52

…e version of Rascal and (b) the edit command is wired to the edit IDEService

basic IDE services can now also browse contents of files which are no…

dde9bd9

…t in the scheme by copyinhthe contents to a tmp file

Merge branch 'main' into robust-parsing-merge-main

6f9b2b9

Merge branch 'main' into revive-robust-parsing

d88ac77

PieterOlivier and others added 23 commits October 23, 2024 14:15

Simplified MultiErrorPico test

5c7398f

Merge branch 'recovery/only-recovery-stacks-left' into recovery/no-ws…

e764a94

…-merge-experiment

Fixed indentation

fa0ed1d

Added comment on the use of "max match length"

90c3f23

Consolidated multi-error issue encountered in Rascal.

b1622c0

MultiErrorBug condenses the Rascal syntax to only cover the problematic issue. The issue itself is of yet unsolved.

Fixed limit calculation in CaseInsenstiveLiteralMatcher

68c7538

Removed obsolete comment

4d94b28

Merge pull request #2057 from usethesource/recovery/only-recovery-sta…

c75cba6

…cks-left Trigger recovery when only recovery stacks are left

Fixed whitespace issues

803724c

Fixed size inconsistency between allocating and enlarging

eadc959

Merge pull request #2062 from usethesource/recovery/prop-prefix-bug

1633109

Fixed size inconsistency between allocating and enlarging

Fixed regression.

e123bdc

Added hasErrors method

32db7de

Removed whitespace

e1c2f56

Layout fixes

93b4205

Added '\n' end matcher so end-of-line is also seen as skip terminator

ff5f36d

Fixed Pico test

377cb8c

Merge pull request #2064 from usethesource/recovery/end-of-line-token

79267ce

Added '\n' end matcher so end-of-line is also seen as skip terminator

Removed whitespace

54a9c53

Showing mean with two digit precision

8d35fc9

Single whitespace change

37ce549

Merge branch 'main' into feat/error-recovery

415773f

Merge branch 'main' into feat/error-recovery

79be4a0

jurgenvinju reviewed Nov 8, 2024

View reviewed changes

sungshik reviewed Nov 11, 2024

View reviewed changes

PieterOlivier and others added 5 commits November 13, 2024 11:31

Renamed 'symbol' field in skipped production to 'def'

e523ce2

Merge pull request #2078 from usethesource/recovery/symbol-to-def

c861e93

Renamed 'symbol' field in skipped production to 'def'

Added timing ratio column and replaced ',' with ':' in end-of-line uri

e8bed22

Improved recovery stats gathering and added R script to process the s…

ea5658f

…tats

Fixed use of '|unknown:///|' as stats location

e848bb5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error recovery #2038

Error recovery #2038

PieterOlivier commented Oct 4, 2024 •

edited

Loading

jurgenvinju Nov 8, 2024

jurgenvinju Nov 8, 2024

DavyLandman Nov 9, 2024

jurgenvinju Nov 9, 2024

jurgenvinju Nov 9, 2024 •

edited

Loading

jurgenvinju Nov 9, 2024 •

edited

Loading

PieterOlivier Nov 10, 2024

jurgenvinju Nov 11, 2024

DavyLandman Nov 12, 2024

jurgenvinju Nov 13, 2024

sungshik Nov 11, 2024 •

edited

Loading

jurgenvinju Nov 11, 2024

jurgenvinju Nov 11, 2024

		@synopsis{Check if a parse tree contains any error nodes, the result of error recovery.}
		bool hasErrors(Tree tree) = /appl(error(_, _, _), _) := tree;

Error recovery #2038

Are you sure you want to change the base?

Error recovery #2038

Conversation

PieterOlivier commented Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jurgenvinju Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

jurgenvinju Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungshik Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PieterOlivier commented Oct 4, 2024 •

edited

Loading

jurgenvinju Nov 9, 2024 •

edited

Loading

jurgenvinju Nov 9, 2024 •

edited

Loading

sungshik Nov 11, 2024 •

edited

Loading