Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: khmer model custom wordbreaker issues #230

Open
1 of 8 tasks
MakaraSok opened this issue May 11, 2023 · 13 comments
Open
1 of 8 tasks

bug: khmer model custom wordbreaker issues #230

MakaraSok opened this issue May 11, 2023 · 13 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@MakaraSok
Copy link
Contributor

Describe the bug

The crash happened after this activity was done. See the crash in action:

predictive.text.crashes.mov

Reproduce the bug

No response

Expected behavior

No response

Related issues

No response

Keyman apps

  • Keyman for Android
  • Keyman for iPhone and iPad
  • Keyman for Linux
  • Keyman for macOS
  • Keyman for Windows
  • Keyman Developer
  • KeymanWeb
  • Other - give details at bottom of form

Keyman version

17.0.104-alpha

Operating system

iOS 16.4

Device

iPhone Pro Max Simulator

Target application

No response

Browser

No response

Keyboard name

sil_jarai

Keyboard version

1.0

Language name

Jarai

Additional context

https://keyman.com/keyboards/sil_jarai?bcp47=jra-khmr

@MakaraSok MakaraSok added the bug Something isn't working label May 11, 2023
@mcdurdin mcdurdin added this to the A17S21 milestone Jul 7, 2023
@mcdurdin mcdurdin removed this from the A17S21 milestone Jul 31, 2023
@sgschantz sgschantz added this to the A17S24 milestone Aug 3, 2023
@jahorton jahorton modified the milestones: A17S24, A17S25 Oct 16, 2023
@sgschantz sgschantz modified the milestones: A17S25, A17S26 Oct 30, 2023
@jahorton jahorton assigned jahorton and unassigned jahorton Nov 23, 2023
@darcywong00 darcywong00 modified the milestones: A17S26, A17S27 Nov 27, 2023
@jahorton
Copy link
Contributor

jahorton commented Dec 4, 2023

This is essentially the same issue seen at keymanapp/keyman#6900, but conflated with issues that arise when handling Khmer script.

Relevant codeblock from the corresponding lexical model:

wordBreaker: function(str: string) {
return str.split(/\s/).map(function(token) {
return {
left: str.indexOf(token),
start: str.indexOf(token),
right: str.indexOf(token) + token.length,
end: str.indexOf(token) + token.length,
text: token
}
});

  wordBreaker: function(str: string) {
    return str.split(/\s/).map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

For starters, note that this "wordbreaker" was always intended to be something of a stand-in until we develop a better way to handle cases with scripts that don't normally do wordbreaking. (The majority language for the script is Khmer, which doesn't... even if Jarai itself does.)

Furthermore, this wordbreaker is not aware of any implicit meaning behind any punctuation marks in the script - it only breaks on spaces and nothing else. Thus, the guillemets (the double angle-brackets acting as quotation marks) are considered the same as letters and thus part of the same word.

Refer to the video associated with keymanapp/keyman#6900:

lm.replace.quote.and.character.typed.with.the.selected.suggestion.mov

The guillemets are replaced because, as far as the system knows, they are part of the word, not separate. This, in turn, naturally has a strong knock-on effect of making predictions a lot more difficult. No Khmer word actually starts with a left-guillemet («), after all.

With my current attempts at reproducing it, the engine actually does recover on the first post-guillemet keystroke most of the time. Selecting such a suggestion also erases the guillemet due to the details noted above re: the model's wordbreaker. It also recovers instantly when starting a new word. Thus, it's not "crashing" - just "failing to find any suggestions."

Finally, note that the predictive-text engine will only allow so much corrections before it stops looking. Having to outright delete the « in order to make good suggestions for the text after it is quite costly, and that doesn't reset within the word at present. So, even when "working", corrections will seem markedly more limited in this context.

@jahorton
Copy link
Contributor

jahorton commented Dec 4, 2023

Looking back through related issue and PR history, this thread seems particularly relevant: https://github.com/keymanapp/keyman/pull/6574/files#r861500917

If we did allow character-class overrides, that'd provide a way to avoid writing a complex custom wordbreaker. But, for now, perhaps I should just tweak this hacky would-be wordbreaker to hack off the « from the actual word.

@jahorton
Copy link
Contributor

jahorton commented Dec 4, 2023

Here's my first-pass prototype at resolving this.

  wordBreaker: function(str: string) {
    const tokens = str.split(/\s/);
  
    for(let i=0; i < tokens.length; i++) {
      const token = tokens[i];
      if(token.length == 1) {
        continue;
      }
  
      // Opening quotes should be considered a separate token from the word they're next to.
      const punctuation = '«';
      let splitPoint = token.indexOf(punctuation);
      if(splitPoint > -1) {
        const left = token.substring(0, splitPoint);  // (0, -1) => ''
        const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''
  
        if(left) {
          tokens.splice(i++, 0, left);
        }
        tokens.splice(i++, 1, punctuation);
        if(right) {
          tokens.splice(i, 0, right);
        }
        // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
        // there was a `right` portion, as it may have extra marks that also need to be spun off.
        i--; 
      }
    }
  
    return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

If there are other punctuation marks worth splitting off, I can extend it further, though there will be a bit of extra complexity needed: marks with earlier indices within a token should be processed before later indices for that same token. A bit of an edge case, to be sure, but it could matter at some point.


This suggestion has been tested locally with punctuation = ' and the string The quick brown 'fox' jumped over the lazy dog. 'qu'ot'at'i'o'n'. (The mangled 'qu'ot'at'i'o'n' was there to stress-test things.)

  • 'fox' => ', fox, '

  • 'qu'ot'at'i'o'n' => ', qu, ', ot, ', at, ', i, ', o, ', n, '

  • full output:

    [
      'The', 'quick', 'brown',  "'",
      'fox', "'",     'jumped', 'over',
      'the', 'lazy',  'dog.',   '',
      "'",   'qu',    "'",      'ot',
      "'",   'at',    "'",      'i',
      "'",   'o',     "'",      'n',
      "'"
    ]
    

    Note that dog. remains because this code isn't checking for . - just '.

@jahorton jahorton transferred this issue from keymanapp/keyman Dec 8, 2023
@jahorton jahorton changed the title bug(ios): predictive text crashes bug: khmer model custom wordbreaker issues Dec 8, 2023
@mcdurdin mcdurdin modified the milestones: A17S27, A17S28 Dec 8, 2023
@mcdurdin mcdurdin modified the milestones: A17S28, A17S29 Dec 30, 2023
@mcdurdin mcdurdin modified the milestones: A17S29, A17S30 Jan 6, 2024
@mcdurdin mcdurdin modified the milestones: A17S30, A17S31 Jan 20, 2024
@jahorton
Copy link
Contributor

jahorton commented Jan 26, 2024

Enhancing this to allow splitting off multiple punctuation marks, rather than just one...

  wordBreaker: function(str: string) {
    const tokens = str.split(/\s/);
  
    for(let i=0; i < tokens.length; i++) {
      const token = tokens[i];
      if(token.length == 1) {
        continue;
      }
  
      // Certain punctuation marks should be considered a separate token from the word they're next to.
      const punctuationMarks = ['«', '»' /* add extras here */];
      const punctSplitIndices = [];
      // Find if and where each mark exists within the token
      for(let i = 0; i < punctuationMarks.length; i++) {
        const split = token.indexOf(punctuationMarks[i]);
        if(split >= 0) {
          punctSplitIndices.push(splilt);
        }
      }
      // Sort and pick the earliest mark's location.  If none exists, use -1.
      punctSplitIndices.sort();
      const splitPoint = punctSplitIndices[0] || -1;

      if(splitPoint > -1) {
        const left = token.substring(0, splitPoint);  // (0, -1) => ''
        const right = token.substring(splitPoint+1);  // Starting past the end of the string => ''
  
        if(left) {
          tokens.splice(i++, 0, left);
        }
        tokens.splice(i++, 1, punctuation);
        if(right) {
          tokens.splice(i, 0, right);
        }
        // Ensure that the next iteration puts `i` immediately after the punctuation token... even if
        // there was a `right` portion, as it may have extra marks that also need to be spun off.
        i--; 
      }
    }
  
    return tokens.map(function(token) {
      return {
        left: str.indexOf(token),
        start: str.indexOf(token),
        right: str.indexOf(token) + token.length,
        end: str.indexOf(token) + token.length,
        text: token
      }
    });

As a reminder, this is a custom wordbreaker used within lexical-model projects. Anywhere you've used this one:

wordBreaker: function(str: string) {
return str.split(/\s/).map(function(token) {
return {
left: str.indexOf(token),
start: str.indexOf(token),
right: str.indexOf(token) + token.length,
end: str.indexOf(token) + token.length,
text: token
}
});

This new one is an enhancement of that, allowing you to also split off whatever specific punctuation marks you define within the array saying to /* add extras here */.

@darcywong00 darcywong00 modified the milestones: B17S6, A18S1 Apr 28, 2024
@darcywong00 darcywong00 modified the milestones: A18S1, A18S2 May 11, 2024
@mcdurdin mcdurdin modified the milestones: A18S2, A18S3 May 24, 2024
@mcdurdin mcdurdin modified the milestones: A18S3, A18S4 Jun 7, 2024
@darcywong00 darcywong00 modified the milestones: A18S4, A18S5 Jun 21, 2024
@darcywong00 darcywong00 modified the milestones: A18S5, A18S6 Jul 5, 2024
@darcywong00 darcywong00 modified the milestones: A18S6, A18S7 Jul 19, 2024
@darcywong00 darcywong00 modified the milestones: A18S7, A18S8 Aug 2, 2024
@darcywong00 darcywong00 modified the milestones: A18S8, A18S9 Aug 17, 2024
@darcywong00 darcywong00 modified the milestones: A18S9, A18S10 Aug 31, 2024
@darcywong00 darcywong00 modified the milestones: A18S10, A18S11 Sep 14, 2024
@darcywong00 darcywong00 modified the milestones: A18S11, A18S12 Sep 28, 2024
@darcywong00 darcywong00 modified the milestones: A18S12, A18S13 Oct 11, 2024
@darcywong00 darcywong00 modified the milestones: A18S13, A18S14 Oct 26, 2024
@darcywong00 darcywong00 modified the milestones: A18S14, A18S15 Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants