-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problems with emoji word/grapheme segmentation #8
Comments
Hi Colin,
Thanks for reporting this. I don't have time in the next couple weeks to
look at it, but I'll try to get to it asap.
Best,
-=rsw
|
hi, thanks for getting back to me. I think I'm going to try and work with icu for my current needs, since I'd really also like to use the dictionary-backed word segmentation for japanese / chinese. However as an exercise I implemented a build script in the repo linked earlier, which will take the official wordbreaktests.txt and code-generates rust test cases for them. Was an excuse for me to try out a build script, but it might be a useful for you? You're currently passing everything in 8.0 but the tests in 9.0 are updated with some of the problems I was reporting. |
noticed you had a script doing this already. Cheers! |
I've started a Unicode 9.0.0 update in #10, though it doesn't yet include the changes needed by these test cases. |
Our code seems to think 🚒 is not Glue_after_Zwj. Looking at WordBreakProperty.txt, that is indeed the case. WordBreakProperty.txt, which we derive our tables from, does not list the fire truck emoji to be a GAZ emoji. Spec says that WordBreakProperty.txt is the normative source:
However, it should be a GAZ emoji. GAZ emoji are (non-normatively) defined as:
It's not a base emoji modifier (you can't append fitzpatrick modifiers or other zwj sequences to it), but it is the latter half of the firefighter sequence. This means that it should be GAZ. |
Emailed the list. Realized belatedly that I should have used the form since this is a bug, not an ambiguity to be discussed :| I would consider the current behavior up to spec. |
hello all,
Unless I'm missing something obvious I think there is some unexpected behaviour around word segmentation of various multi-codepoint emoji. I've put some test cases in a repo. There may be issues around grapheme segmentation in these cases as well?
I can think of a few other emoji combinations that are probably also failing; I'd be happy to write up some more test cases if this is useful?
thanks for all your work, and let me know if I can be of any help. ✌️😎 💭 🐛💥🔨
The text was updated successfully, but these errors were encountered: