problems with emoji word/grapheme segmentation #8

cmyr · 2016-11-11T00:41:54Z

hello all,

Unless I'm missing something obvious I think there is some unexpected behaviour around word segmentation of various multi-codepoint emoji. I've put some test cases in a repo. There may be issues around grapheme segmentation in these cases as well?

I can think of a few other emoji combinations that are probably also failing; I'd be happy to write up some more test cases if this is useful?

thanks for all your work, and let me know if I can be of any help. ✌️😎 💭 🐛💥🔨

kwantam · 2016-11-11T01:04:54Z

Hi Colin, Thanks for reporting this. I don't have time in the next couple weeks to look at it, but I'll try to get to it asap. Best, -=rsw

cmyr · 2016-11-14T14:22:19Z

hi, thanks for getting back to me.

I think I'm going to try and work with icu for my current needs, since I'd really also like to use the dictionary-backed word segmentation for japanese / chinese. However as an exercise I implemented a build script in the repo linked earlier, which will take the official wordbreaktests.txt and code-generates rust test cases for them. Was an excuse for me to try out a build script, but it might be a useful for you? You're currently passing everything in 8.0 but the tests in 9.0 are updated with some of the problems I was reporting.

cmyr · 2016-11-16T16:46:47Z

noticed you had a script doing this already. Cheers!

mbrubeck · 2016-12-13T00:20:48Z

I've started a Unicode 9.0.0 update in #10, though it doesn't yet include the changes needed by these test cases.

kwantam · 2016-12-22T05:22:30Z

Fixed by #10

@cmyr if I'm wrong please let me know. Thanks for reporting this!

cmyr · 2016-12-22T16:26:13Z

@kwantam @mbrubeck the cases mentioned here are fixed, but there are still some issues with some of the new unicode 9.0 ZWJ sequences. I've added some extra test cases in a branch here, would you like me to PR this?

Manishearth · 2016-12-22T18:05:18Z

Our code seems to think 🚒 is not Glue_after_Zwj.

Looking at WordBreakProperty.txt, that is indeed the case. WordBreakProperty.txt, which we derive our tables from, does not list the fire truck emoji to be a GAZ emoji.

Spec says that WordBreakProperty.txt is the normative source:

The Word_Break property value assignments are explicitly listed in the corresponding data file in [Props]. The values in that file are the normative property values.

However, it should be a GAZ emoji. GAZ emoji are (non-normatively) defined as:

Emoji characters that do not break from a previous ZWJ in a defined emoji zwj sequence, and are not listed as Emoji_Modifier_Base=Yes in emoji-data.txt. See [UTR51].

It's not a base emoji modifier (you can't append fitzpatrick modifiers or other zwj sequences to it), but it is the latter half of the firefighter sequence. This means that it should be GAZ.

Manishearth · 2016-12-22T18:38:29Z

Emailed the list. Realized belatedly that I should have used the form since this is a bug, not an ambiguity to be discussed :|

I would consider the current behavior up to spec.

cmyr mentioned this issue Dec 21, 2016

Update to Unicode 9.0.0 #10

Merged

kwantam closed this as completed Dec 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems with emoji word/grapheme segmentation #8

problems with emoji word/grapheme segmentation #8

cmyr commented Nov 11, 2016

kwantam commented Nov 11, 2016 via email

cmyr commented Nov 14, 2016

cmyr commented Nov 16, 2016

mbrubeck commented Dec 13, 2016

kwantam commented Dec 22, 2016

cmyr commented Dec 22, 2016

Manishearth commented Dec 22, 2016 •

edited

Loading

Manishearth commented Dec 22, 2016

problems with emoji word/grapheme segmentation #8

problems with emoji word/grapheme segmentation #8

Comments

cmyr commented Nov 11, 2016

kwantam commented Nov 11, 2016 via email

cmyr commented Nov 14, 2016

cmyr commented Nov 16, 2016

mbrubeck commented Dec 13, 2016

kwantam commented Dec 22, 2016

cmyr commented Dec 22, 2016

Manishearth commented Dec 22, 2016 • edited Loading

Manishearth commented Dec 22, 2016

Manishearth commented Dec 22, 2016 •

edited

Loading