-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: khmer model custom wordbreaker issues #230
Comments
This is essentially the same issue seen at keymanapp/keyman#6900, but conflated with issues that arise when handling Khmer script. Relevant codeblock from the corresponding lexical model: lexical-models/release/sil/sil.jra-khmr.jarai/source/sil.jra-khmr.jarai.model.ts Lines 12 to 21 in 0491eb0
wordBreaker: function(str: string) {
return str.split(/\s/).map(function(token) {
return {
left: str.indexOf(token),
start: str.indexOf(token),
right: str.indexOf(token) + token.length,
end: str.indexOf(token) + token.length,
text: token
}
}); For starters, note that this "wordbreaker" was always intended to be something of a stand-in until we develop a better way to handle cases with scripts that don't normally do wordbreaking. (The majority language for the script is Khmer, which doesn't... even if Jarai itself does.) Furthermore, this wordbreaker is not aware of any implicit meaning behind any punctuation marks in the script - it only breaks on spaces and nothing else. Thus, the guillemets (the double angle-brackets acting as quotation marks) are considered the same as letters and thus part of the same word. Refer to the video associated with keymanapp/keyman#6900: lm.replace.quote.and.character.typed.with.the.selected.suggestion.movThe guillemets are replaced because, as far as the system knows, they are part of the word, not separate. This, in turn, naturally has a strong knock-on effect of making predictions a lot more difficult. No Khmer word actually starts with a left-guillemet ( With my current attempts at reproducing it, the engine actually does recover on the first post-guillemet keystroke most of the time. Selecting such a suggestion also erases the guillemet due to the details noted above re: the model's wordbreaker. It also recovers instantly when starting a new word. Thus, it's not "crashing" - just "failing to find any suggestions." Finally, note that the predictive-text engine will only allow so much corrections before it stops looking. Having to outright delete the |
Looking back through related issue and PR history, this thread seems particularly relevant: https://github.com/keymanapp/keyman/pull/6574/files#r861500917 If we did allow character-class overrides, that'd provide a way to avoid writing a complex custom wordbreaker. But, for now, perhaps I should just tweak this hacky would-be wordbreaker to hack off the |
Here's my first-pass prototype at resolving this. wordBreaker: function(str: string) {
const tokens = str.split(/\s/);
for(let i=0; i < tokens.length; i++) {
const token = tokens[i];
if(token.length == 1) {
continue;
}
// Opening quotes should be considered a separate token from the word they're next to.
const punctuation = '«';
let splitPoint = token.indexOf(punctuation);
if(splitPoint > -1) {
const left = token.substring(0, splitPoint); // (0, -1) => ''
const right = token.substring(splitPoint+1); // Starting past the end of the string => ''
if(left) {
tokens.splice(i++, 0, left);
}
tokens.splice(i++, 1, punctuation);
if(right) {
tokens.splice(i, 0, right);
}
// Ensure that the next iteration puts `i` immediately after the punctuation token... even if
// there was a `right` portion, as it may have extra marks that also need to be spun off.
i--;
}
}
return tokens.map(function(token) {
return {
left: str.indexOf(token),
start: str.indexOf(token),
right: str.indexOf(token) + token.length,
end: str.indexOf(token) + token.length,
text: token
}
}); If there are other punctuation marks worth splitting off, I can extend it further, though there will be a bit of extra complexity needed: marks with earlier indices within a token should be processed before later indices for that same token. A bit of an edge case, to be sure, but it could matter at some point. This suggestion has been tested locally with punctuation =
|
Enhancing this to allow splitting off multiple punctuation marks, rather than just one... wordBreaker: function(str: string) {
const tokens = str.split(/\s/);
for(let i=0; i < tokens.length; i++) {
const token = tokens[i];
if(token.length == 1) {
continue;
}
// Certain punctuation marks should be considered a separate token from the word they're next to.
const punctuationMarks = ['«', '»' /* add extras here */];
const punctSplitIndices = [];
// Find if and where each mark exists within the token
for(let i = 0; i < punctuationMarks.length; i++) {
const split = token.indexOf(punctuationMarks[i]);
if(split >= 0) {
punctSplitIndices.push(splilt);
}
}
// Sort and pick the earliest mark's location. If none exists, use -1.
punctSplitIndices.sort();
const splitPoint = punctSplitIndices[0] || -1;
if(splitPoint > -1) {
const left = token.substring(0, splitPoint); // (0, -1) => ''
const right = token.substring(splitPoint+1); // Starting past the end of the string => ''
if(left) {
tokens.splice(i++, 0, left);
}
tokens.splice(i++, 1, punctuation);
if(right) {
tokens.splice(i, 0, right);
}
// Ensure that the next iteration puts `i` immediately after the punctuation token... even if
// there was a `right` portion, as it may have extra marks that also need to be spun off.
i--;
}
}
return tokens.map(function(token) {
return {
left: str.indexOf(token),
start: str.indexOf(token),
right: str.indexOf(token) + token.length,
end: str.indexOf(token) + token.length,
text: token
}
}); As a reminder, this is a custom wordbreaker used within lexical-model projects. Anywhere you've used this one: lexical-models/release/sil/sil.jra-khmr.jarai/source/sil.jra-khmr.jarai.model.ts Lines 12 to 21 in 0491eb0
This new one is an enhancement of that, allowing you to also split off whatever specific punctuation marks you define within the array saying to |
Describe the bug
The crash happened after this activity was done. See the crash in action:
predictive.text.crashes.mov
Reproduce the bug
No response
Expected behavior
No response
Related issues
No response
Keyman apps
Keyman version
17.0.104-alpha
Operating system
iOS 16.4
Device
iPhone Pro Max Simulator
Target application
No response
Browser
No response
Keyboard name
sil_jarai
Keyboard version
1.0
Language name
Jarai
Additional context
https://keyman.com/keyboards/sil_jarai?bcp47=jra-khmr
The text was updated successfully, but these errors were encountered: