Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output #12229

Merged
merged 2 commits into from
Aug 21, 2024

Conversation

jahorton
Copy link
Contributor

This aims to mitigate the worst side-effects from custom wordbreakers exhibiting the same behaviors that resulted in #12200.

This strongly enforces tokenization in a single direction within the context, never rewinding - even if an encountered span.left value would otherwise indicate to do so. This occurred in the improperly-implemented custom wordbreaker we published with sil.km.gcc - see keymanapp/lexical-models#265, which corrects it.

As blank-string spans '' always appear first within a string at index 0, any time such a blank token appeared, it had the effect of that span being mapped to the full context that preceded the span's actual position. The next span in line would also appear to start from the beginning of context, which would never reasonably match words in the model unless only one word was in the context... generally breaking predictive text. This change prevents both cases by simply replacing that index with the largest-reached index. While not absolutely perfect, it's simple and pretty close to what we want.

This does not mitigate scenarios that resulted in blank spans being emitted where they shouldn't be - such as between contiguous whitespace characters. Mitigating that would be notably more complex.

To validate the changes and help maintain them in the future, I've added a couple of associated unit tests, using both versions of the sil.km.gcc wordbreaker as a text fixture.

@keymanapp-test-bot skip

@jahorton jahorton requested a review from mcdurdin as a code owner August 20, 2024 04:02
@keymanapp-test-bot
Copy link

User Test Results

Test specification and instructions

User tests are not required

@github-actions github-actions bot added web/ and removed web/ labels Aug 20, 2024
// Mitigation aims to prevent the _worst_ side-effects that can result from invalidating the
// underlying assumption of a monotonically-increasing index within the context -
// assigning repeated or blank entries the text that preceded them!
assert.notExists(tokenized.left.find((token) => token.text == text));
Copy link
Contributor Author

@jahorton jahorton Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be extra clear: without the changes in tokenization.ts, this assertion will fail.

The consequences of that are what leads to predictive-text breaking - a full-context string of multiple words can't serve well as the prefix to a single word when looking up lexical entries.

@mcdurdin
Copy link
Member

I think this needs a cherry-pick to 17.0, right?

@jahorton
Copy link
Contributor Author

jahorton commented Aug 20, 2024

I think this needs a cherry-pick to 17.0, right?

The change that caused the behavior this mitigates is 18.0-only - it was part of auto-correct work.

Wouldn't hurt to double-check on a device running 17.0-stable first before completely dismissing the question, though.

@jahorton
Copy link
Contributor Author

Double-checked with the current stable build for iOS - the issue does not arise there due to less stringent requirements on custom wordbreakers. (We aren't making auto-correct available there, which is what increased the strictness.)

@jahorton jahorton merged commit 28ccfe1 into master Aug 21, 2024
17 checks passed
@jahorton jahorton deleted the fix/web/mitigate-malformed-wordbreaking branch August 21, 2024 01:26
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 18.0.94-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants