-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output #12229
Conversation
…r span .left, .right values in output
User Test ResultsTest specification and instructions User tests are not required |
// Mitigation aims to prevent the _worst_ side-effects that can result from invalidating the | ||
// underlying assumption of a monotonically-increasing index within the context - | ||
// assigning repeated or blank entries the text that preceded them! | ||
assert.notExists(tokenized.left.find((token) => token.text == text)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be extra clear: without the changes in tokenization.ts
, this assertion will fail.
The consequences of that are what leads to predictive-text breaking - a full-context string of multiple words can't serve well as the prefix to a single word when looking up lexical entries.
I think this needs a cherry-pick to 17.0, right? |
The change that caused the behavior this mitigates is 18.0-only - it was part of auto-correct work. Wouldn't hurt to double-check on a device running 17.0-stable first before completely dismissing the question, though. |
Double-checked with the current stable build for iOS - the issue does not arise there due to less stringent requirements on custom wordbreakers. (We aren't making auto-correct available there, which is what increased the strictness.) |
Changes in this pull request will be available for download in Keyman version 18.0.94-alpha |
This aims to mitigate the worst side-effects from custom wordbreakers exhibiting the same behaviors that resulted in #12200.
This strongly enforces tokenization in a single direction within the context, never rewinding - even if an encountered
span.left
value would otherwise indicate to do so. This occurred in the improperly-implemented custom wordbreaker we published withsil.km.gcc
- see keymanapp/lexical-models#265, which corrects it.As blank-string spans
''
always appear first within astring
at index0
, any time such a blank token appeared, it had the effect of that span being mapped to the full context that preceded the span's actual position. The next span in line would also appear to start from the beginning of context, which would never reasonably match words in the model unless only one word was in the context... generally breaking predictive text. This change prevents both cases by simply replacing that index with the largest-reached index. While not absolutely perfect, it's simple and pretty close to what we want.This does not mitigate scenarios that resulted in blank spans being emitted where they shouldn't be - such as between contiguous whitespace characters. Mitigating that would be notably more complex.
To validate the changes and help maintain them in the future, I've added a couple of associated unit tests, using both versions of the
sil.km.gcc
wordbreaker as a text fixture.@keymanapp-test-bot skip