fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output #12229

jahorton · 2024-08-20T04:02:06Z

This aims to mitigate the worst side-effects from custom wordbreakers exhibiting the same behaviors that resulted in #12200.

This strongly enforces tokenization in a single direction within the context, never rewinding - even if an encountered span.left value would otherwise indicate to do so. This occurred in the improperly-implemented custom wordbreaker we published with sil.km.gcc - see keymanapp/lexical-models#265, which corrects it.

As blank-string spans '' always appear first within a string at index 0, any time such a blank token appeared, it had the effect of that span being mapped to the full context that preceded the span's actual position. The next span in line would also appear to start from the beginning of context, which would never reasonably match words in the model unless only one word was in the context... generally breaking predictive text. This change prevents both cases by simply replacing that index with the largest-reached index. While not absolutely perfect, it's simple and pretty close to what we want.

This does not mitigate scenarios that resulted in blank spans being emitted where they shouldn't be - such as between contiguous whitespace characters. Mitigating that would be notably more complex.

To validate the changes and help maintain them in the future, I've added a couple of associated unit tests, using both versions of the sil.km.gcc wordbreaker as a text fixture.

@keymanapp-test-bot skip

…r span .left, .right values in output

keymanapp-test-bot · 2024-08-20T04:02:10Z

User Test Results

Test specification and instructions

User tests are not required

jahorton · 2024-08-20T04:06:07Z

common/models/templates/test/test-tokenization.js

+      // Mitigation aims to prevent the _worst_ side-effects that can result from invalidating the
+      // underlying assumption of a monotonically-increasing index within the context -
+      // assigning repeated or blank entries the text that preceded them!
+      assert.notExists(tokenized.left.find((token) => token.text == text));


To be extra clear: without the changes in tokenization.ts, this assertion will fail.

The consequences of that are what leads to predictive-text breaking - a full-context string of multiple words can't serve well as the prefix to a single word when looking up lexical entries.

mcdurdin · 2024-08-20T13:53:02Z

I think this needs a cherry-pick to 17.0, right?

jahorton · 2024-08-20T16:43:54Z

I think this needs a cherry-pick to 17.0, right?

The change that caused the behavior this mitigates is 18.0-only - it was part of auto-correct work.

Wouldn't hurt to double-check on a device running 17.0-stable first before completely dismissing the question, though.

jahorton · 2024-08-21T01:26:39Z

Double-checked with the current stable build for iOS - the issue does not arise there due to less stringent requirements on custom wordbreakers. (We aren't making auto-correct available there, which is what increased the strictness.)

keyman-server · 2024-08-21T18:04:10Z

Changes in this pull request will be available for download in Keyman version 18.0.94-alpha

fix(web): improve tokenization output when wordbreaker breaks spec fo…

9227569

…r span .left, .right values in output

jahorton requested a review from mcdurdin as a code owner August 20, 2024 04:02

keymanapp-test-bot bot added this to the A18S9 milestone Aug 20, 2024

github-actions bot added common/ common/models/ common/models/templates/ fix web/ labels Aug 20, 2024

feat(web): adds extra assertion to 'mitigation' test

ec32e21

github-actions bot added web/ and removed web/ labels Aug 20, 2024

jahorton commented Aug 20, 2024

View reviewed changes

mcdurdin approved these changes Aug 20, 2024

View reviewed changes

jahorton merged commit 28ccfe1 into master Aug 21, 2024
17 checks passed

jahorton deleted the fix/web/mitigate-malformed-wordbreaking branch August 21, 2024 01:26

jahorton mentioned this pull request Aug 21, 2024

fix: fix custom wordbreaker output format for sil.km.gcc keymanapp/lexical-models#265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output #12229

fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output #12229

jahorton commented Aug 20, 2024

keymanapp-test-bot bot commented Aug 20, 2024

jahorton Aug 20, 2024 •

edited

Loading

mcdurdin commented Aug 20, 2024

jahorton commented Aug 20, 2024 •

edited

Loading

jahorton commented Aug 21, 2024

keyman-server commented Aug 21, 2024

fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output #12229

fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output #12229

Conversation

jahorton commented Aug 20, 2024

keymanapp-test-bot bot commented Aug 20, 2024

User Test Results

jahorton Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

mcdurdin commented Aug 20, 2024

jahorton commented Aug 20, 2024 • edited Loading

jahorton commented Aug 21, 2024

keyman-server commented Aug 21, 2024

jahorton Aug 20, 2024 •

edited

Loading

jahorton commented Aug 20, 2024 •

edited

Loading