-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to Unicode 9.0.0 #10
Conversation
5d0a61d
to
624aba4
Compare
I tried an optimization for the regional indicators in Manishearth@2297574 . Does that make sense, or did you mean something else? The state doesn't persist when you jump between forward and reverse iteration because that sounds like a rare use case and I don't want to add extra overhead to the forward iteration case. |
Added support for the RI stuff too. The tests don't seem to handle flags at all? This was completely missed by the tests, which isn't good. Can we contribute some tests? Simple test is that 🇦🇫🇦🇽🇦🇱🇩🇿🇦🇸🇦🇩🇦🇴 should decompose into individual flags when word-sliced. |
Yeah, none of their tests contain more than two regional indicators :| |
49f312b
to
4849b31
Compare
All tests pass, ! On a serious note, the official tests are lacking, so this should be reviewed closely. I added some tests for the missing cases which I found, but there are more things I may have missed. |
// rule GB9b: include any preceding Prepend characters | ||
for (i, c) in self.string[..idx].char_indices().rev() { | ||
// TODO: Cache this to avoid repeated lookups in the common case. | ||
match gr::grapheme_category(c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This grapheme_category
call still seems like it will be doing duplicate work. The caching I was thinking of adding was in saving the result of this call on the last time through this loop, so that grapheme_category
doesn't run multiple times on the same char.
for (i, c) in self.string[..idx].char_indices().rev() { | ||
// TODO: Cache this to avoid repeated lookups in the common case. | ||
match gr::grapheme_category(c) { | ||
gr::GC_Prepend => idx = i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this needs to either invalidate or update self.catb
.
This also means we are missing a test case for this code, specifically for the case where multiple adjacent Prepend characters come between two "normal" characters.
@@ -80,9 +80,9 @@ enum UWordBoundsState { | |||
Numeric, | |||
Katakana, | |||
ExtendNumLet, | |||
Regional, | |||
Regional(/* half */ bool), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be more readable to split these into separate enum variants, { Regional, RegionalHalf, Zwj, ZwjTainted }
?
Added a fix and test for the |
38aa317
to
8e06fc9
Compare
well, r=me on your code, I went through that pretty thoroughly when figuring out what was missing |
Hey @Manishearth, @mbrubeck: I know I've been entirely an absentee on this, so first: thank you both for the flurry of work here! I have full faith in both of you and your efforts here, but I don't want to (appear to) entirely shirk my responsibility (though I will admit that I may be well past avoiding that appearance...), so: would I offend either of you if I took a very quick look over this tomorrow before merging? Once again: thank you!!! |
Sounds fine by me! Like I said, the official tests are incomplete (especially the word break ones) so it would be good to have more review. |
hi all, I've run this branch against the tests mentioned in #8 and my single ZWJ case is still failing. I would be happy to help write more test cases around RI symbols and emoji ZWJ stuff if it would be helpful? |
Ugh, WB4 raises its ugly head again. |
Fixed. The spec is a bit unclear on the exact greediness of WB4. In particular, it's not clear whether or not sequences like In the current implementation WB4 is not greedy and will let WB3c create any no_boundaries it wants first, so complicated format/extend sequences containing zwj are equivalent to ZWJ in the context of WB3c |
Is it worth checking whether other implementations make the same decision regarding greediness? Who else has this implemented? Perhaps libiconv? |
e318038
to
8bac7c7
Compare
So.. I realized that we also fail https://github.com/cmyr/rust-unicode-segmentation-tests/blob/master/src/lib.rs#L18, because we treat the space as its own word. And the spec says nothing about spaces. Spaces are a regular Any character. Logically, it makes sense to drop spaces, but the tests contain these two cases:
which means that spaces still need to join with ZWJ, so the rules aren't that simple. I'm pushing up the tests. |
I plan to open a dialogue with the spec authors about:
so I'd prefer to merge this based on the current spec, hash out clarifications, and later fix any issues. |
The rules are specified as applying in order, so this is correct: WB3c should match before WB4 is applied. |
No, that's the point, that's not clear at all. WB3c does not apply to WB4 can apply to that string in multiple ways. It can collapse it into The problem is that WB4 makes this algorithm loop in on itself. Suddenly, the greediness of WB4 matters because it affects WB3c in the next iteration after an equivalence has been collapsed. |
Okay, the spaces thing isn't an issue -- I confused "Word boundaries" and "extracted words" in http://www.unicode.org/reports/tr29/proposed.html#Word_Boundaries We probably should expose an API for the latter if we don't already. So the tests as-committed are fine. |
(Sent the email, focused on WB4. The spaces thing isn't a problem, and I'll discuss the indic thing on a different list) |
Question: is this ready to be merged or are we waiting on answers to questions? |
Ready to be merged. |
("\u{1f468}\u{200d}\u{1f468}\u{200d}\u{1f466}", &["\u{1f468}\u{200d}\u{1f468}\u{200d}\u{1f466}"]), | ||
("😌👎🏼", &["😌", "👎🏼"]), | ||
// perhaps wrong, spaces should not be included? | ||
("hello world", &["hello", " ", "world"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the answer to @Manishearth's question to the Unicode folks answer this question? If not, how can I help us resolve this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is independent of the spec. The spec tells us that the boundaries in hello world
are hello| | world
We implement that. However, when we think of words we ignore spaces, so a "word iterator" would give hello|world
. The question is what API we should expose.
It's probably better to have a notion of word iterator built on top of the regular word boundary iterator. This can be handled separately.
@Manishearth just the one question above. Also: thoughts on version numbering? The Rust API has no breaking change, but we might stretch our notion of API to include "Unicode version," in which case it would arguably be reasonable to increment the major number. Finally: thoughts on squashing vs. keeping separate commits? (@mbrubeck please chime in too, of course) |
semver is not just about compile-breaking changes, semantic breaking changes can be counted too. I would do a major version bump. Keeping separate commits is fine IMO. |
@mbrubeck @Manishearth @cmyr thanks for your hard work on this 👍 |
thanks @Manishearth @kwantam @mbrubeck, I'll play around with this today and see if I can find any more weird ZWJ behaviour. |
So my interpretation of WB4 was wrong. We have to apply it in order of precedence. Thus, we cannot transform the input before attempting to apply WB3 (and the previous rules). Only then do we transform, and at that point the previous rules can no longer be applied again. |
Clarification: does that mean the current functionality is wrong? If so I can open an issue. |
Yes, it does. Fixing this can get complicated. |
Thanks. Opened #11 for this. I'll try to find some time to think about it. |
This is a work-in-progress update to Unicode 9.0.0. I'm submitting it here in case someone else wants to work on it, because I'm not sure whether I'll have time to finish it soon.
The grapheme cluster changes are all implemented and the tests are all passing, but there are some new inefficiencies in reverse iteration over grapheme clusters. See the "TODO" comments for details.
The word boundary changes are mostly not implemented yet, so some word boundary tests are failing.