Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with emoji word/grapheme segmentation #8

Closed
cmyr opened this issue Nov 11, 2016 · 8 comments
Closed

problems with emoji word/grapheme segmentation #8

cmyr opened this issue Nov 11, 2016 · 8 comments

Comments

@cmyr
Copy link

cmyr commented Nov 11, 2016

hello all,

Unless I'm missing something obvious I think there is some unexpected behaviour around word segmentation of various multi-codepoint emoji. I've put some test cases in a repo. There may be issues around grapheme segmentation in these cases as well?

I can think of a few other emoji combinations that are probably also failing; I'd be happy to write up some more test cases if this is useful?

thanks for all your work, and let me know if I can be of any help. ✌️😎 💭 🐛💥🔨

@kwantam
Copy link
Member

kwantam commented Nov 11, 2016 via email

@cmyr
Copy link
Author

cmyr commented Nov 14, 2016

hi, thanks for getting back to me.

I think I'm going to try and work with icu for my current needs, since I'd really also like to use the dictionary-backed word segmentation for japanese / chinese. However as an exercise I implemented a build script in the repo linked earlier, which will take the official wordbreaktests.txt and code-generates rust test cases for them. Was an excuse for me to try out a build script, but it might be a useful for you? You're currently passing everything in 8.0 but the tests in 9.0 are updated with some of the problems I was reporting.

@cmyr
Copy link
Author

cmyr commented Nov 16, 2016

noticed you had a script doing this already. Cheers!

@mbrubeck
Copy link
Contributor

I've started a Unicode 9.0.0 update in #10, though it doesn't yet include the changes needed by these test cases.

@kwantam
Copy link
Member

kwantam commented Dec 22, 2016

Fixed by #10

@cmyr if I'm wrong please let me know. Thanks for reporting this!

@kwantam kwantam closed this as completed Dec 22, 2016
@cmyr
Copy link
Author

cmyr commented Dec 22, 2016

@kwantam @mbrubeck the cases mentioned here are fixed, but there are still some issues with some of the new unicode 9.0 ZWJ sequences. I've added some extra test cases in a branch here, would you like me to PR this?

@Manishearth
Copy link
Member

Manishearth commented Dec 22, 2016

Our code seems to think 🚒 is not Glue_after_Zwj.

Looking at WordBreakProperty.txt, that is indeed the case. WordBreakProperty.txt, which we derive our tables from, does not list the fire truck emoji to be a GAZ emoji.

Spec says that WordBreakProperty.txt is the normative source:

The Word_Break property value assignments are explicitly listed in the corresponding data file in [Props]. The values in that file are the normative property values.

However, it should be a GAZ emoji. GAZ emoji are (non-normatively) defined as:

Emoji characters that do not break from a previous ZWJ in a defined emoji zwj sequence, and are not listed as Emoji_Modifier_Base=Yes in emoji-data.txt. See [UTR51].

It's not a base emoji modifier (you can't append fitzpatrick modifiers or other zwj sequences to it), but it is the latter half of the firefighter sequence. This means that it should be GAZ.

@Manishearth
Copy link
Member

Emailed the list. Realized belatedly that I should have used the form since this is a bug, not an ambiguity to be discussed :|

I would consider the current behavior up to spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants