Treatment of surrogate code points #119
-
You said there are no stupid questions. Well, here goes… I'm curious about how bstr handles surrogate code points. If such a byte sequence comes up, it's judged to be invalid UTF-8 per the decoding strategy of Bjoern Hoehrmann (adapted in bstr). I can understand that… though it makes me wonder about possible differences between encoding validity and code point validity. These questions can become relevant when working on, e.g., Unicode normalization or collation. The conformance tests for collation do include surrogate code points and other dubious input, and there's a way of handling it. Anyway, this is a digression. If we accept that illegal code points are invalid UTF-8, to be replaced with U+FFFD, there is then the question of which replacement approach to follow. I guess the preferred one is by "maximal subpart." Here's where my possibly dumb question comes in. Given that a sequence representing a surrogate code point—e.g., [ED, A0, 81] representing U+D801—follows the rules of UTF-8 in a sense, shouldn't it be replaced with one instance of U+FFFD? What I found is somewhat different. I'll put example code below. But I don't know enough about UTF-8 validation to tell whether this is desired behavior. Thanks in advance! use bstr::{ByteSlice, B};
fn main() {
// Make illegal char from D801 (surrogate)
let x = unsafe { char::from_u32_unchecked(55297) };
let mut s = String::new();
s.push('a');
s.push(x);
s.push('c');
println!("Chars: {}", s.chars().count()); // 3
for c in s.chars() {
println!("{:04X}", c as u32);
// 0061
// D801
// 0063
}
let b = B(&s);
println!("{:X?}", b); // [61, ED, A0, 81, 63]
println!("Chars: {}", b.chars().count()); // 5
for c in b.chars() {
println!("{:04X}", c as u32);
// 0061
// FFFD
// FFFD
// FFFD
// 0063
}
} |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
Quick follow-up: I guess what happens is that ED followed by anything outside of the range 80..=9F immediately puts the automaton in the REJECT state. It would never get to the point of considering a three-byte code point. |
Beta Was this translation helpful? Give feedback.
-
Nice question. I think this is more of a doc bug/imprecision than anything else. The issue is that I use the word "codepoint" in a lot of places when the more precise term here would be "Unicode scalar value." WIth that said, the specific part of the docs that describe the "substitution by maximal subparts" behavior is actually precisely correct:
Since surrogate codepoints do not have a corresponding valid UTF-8 code unit sequence, it follows that the doc wording above precisely describes the current behavior here. The key here is that even though you can follow the "UTF-8 encoding algorithm" to encode surrogate codepoints, the definition of UTF-8 itself excludes them as valid sequences. And any correct UTF-8 decoder will indeed reject them. If you do need to handle surrogates differently than regular invalid UTF-8 (I am somewhat surprised by that, although I am not familiar with the Unicode Collation Algorithm), then I think it's probably reasonable at that point to write your own UTF-8 decoder. Or somehow build something on top of what Otherwise, with respect to the docs, I do tend to use "codepoint" instead of the more precisely correct term "Unicode scalar value" because "codepoint" is much less of a mouthful. Also, since |
Beta Was this translation helpful? Give feedback.
-
I looked into it more, and bstr does follow the "maximal subpart" approach correctly:
Afaik, this shouldn't cause a problem for normalization (which I saw listed as "maybe in scope" for bstr). If you encounter a sequence containing a surrogate code point, and replace that with repeated U+FFFD, it shouldn't make much difference, since such code points couldn't be affected by decomposition or reordering, anyway. The weird case is collation (which, iirc, is "out of scope" here). The algorithm specification has a few things to say about these problems:
However, out of the box, the conformance tests expect the second of those approaches. e.g., they want the sequence Anyway, thanks for your help! |
Beta Was this translation helpful? Give feedback.
-
Yeah, that's totally possible, and bstr-compatible. Take bstr chars, which come with validation; normalize them to form NFD (which doesn't need to be difficult, though perf can be a challenge); then put that through the UCA. Invalid sequences will have been converted to U+FFFD, which is assigned a weight (a very high one) in the collation element table. Any UCA implementation can deal with U+FFFD. The questionable part is calculating different weights for different surrogate code points. Set that goal aside, and there's no problem. |
Beta Was this translation helpful? Give feedback.
Nice question.
I think this is more of a doc bug/imprecision than anything else. The issue is that I use the word "codepoint" in a lot of places when the more precise term here would be "Unicode scalar value." WIth that said, the specific part of the docs that describe the "substitution by maximal subparts" behavior is actually precisely correct:
Since surrogate cod…