Treatment of surrogate code points #119

theodore-s-beers · 2022-07-12T10:09:54Z

theodore-s-beers
Jul 12, 2022

You said there are no stupid questions. Well, here goes…

I'm curious about how bstr handles surrogate code points. If such a byte sequence comes up, it's judged to be invalid UTF-8 per the decoding strategy of Bjoern Hoehrmann (adapted in bstr). I can understand that… though it makes me wonder about possible differences between encoding validity and code point validity. These questions can become relevant when working on, e.g., Unicode normalization or collation. The conformance tests for collation do include surrogate code points and other dubious input, and there's a way of handling it. Anyway, this is a digression.

If we accept that illegal code points are invalid UTF-8, to be replaced with U+FFFD, there is then the question of which replacement approach to follow. I guess the preferred one is by "maximal subpart." Here's where my possibly dumb question comes in. Given that a sequence representing a surrogate code point—e.g., [ED, A0, 81] representing U+D801—follows the rules of UTF-8 in a sense, shouldn't it be replaced with one instance of U+FFFD? What I found is somewhat different. I'll put example code below. But I don't know enough about UTF-8 validation to tell whether this is desired behavior.

Thanks in advance!

use bstr::{ByteSlice, B};

fn main() {
    // Make illegal char from D801 (surrogate)
    let x = unsafe { char::from_u32_unchecked(55297) };

    let mut s = String::new();

    s.push('a');
    s.push(x);
    s.push('c');

    println!("Chars: {}", s.chars().count()); // 3

    for c in s.chars() {
        println!("{:04X}", c as u32);
        // 0061
        // D801
        // 0063
    }

    let b = B(&s);

    println!("{:X?}", b); // [61, ED, A0, 81, 63]

    println!("Chars: {}", b.chars().count()); // 5

    for c in b.chars() {
        println!("{:04X}", c as u32);
        // 0061
        // FFFD
        // FFFD
        // FFFD
        // 0063
    }
}

Answered by BurntSushi

Jul 12, 2022

Nice question.

I think this is more of a doc bug/imprecision than anything else. The issue is that I use the word "codepoint" in a lot of places when the more precise term here would be "Unicode scalar value." WIth that said, the specific part of the docs that describe the "substitution by maximal subparts" behavior is actually precisely correct:

In this strategy, a replacement codepoint is inserted whenever a byte is found that cannot possibly lead to a valid UTF-8 code unit sequence. If there were previous bytes that represented a prefix of a well-formed UTF-8 code unit sequence, then all of those bytes (up to 3) are substituted with a single replacement codepoint.

Since surrogate cod…

View full answer

theodore-s-beers · 2022-07-12T11:59:04Z

theodore-s-beers
Jul 12, 2022
Author

Quick follow-up: I guess what happens is that ED followed by anything outside of the range 80..=9F immediately puts the automaton in the REJECT state. It would never get to the point of considering a three-byte code point.

0 replies

BurntSushi · 2022-07-12T12:11:31Z

BurntSushi
Jul 12, 2022
Maintainer

Nice question.

I think this is more of a doc bug/imprecision than anything else. The issue is that I use the word "codepoint" in a lot of places when the more precise term here would be "Unicode scalar value." WIth that said, the specific part of the docs that describe the "substitution by maximal subparts" behavior is actually precisely correct:

In this strategy, a replacement codepoint is inserted whenever a byte is found that cannot possibly lead to a valid UTF-8 code unit sequence. If there were previous bytes that represented a prefix of a well-formed UTF-8 code unit sequence, then all of those bytes (up to 3) are substituted with a single replacement codepoint.

Since surrogate codepoints do not have a corresponding valid UTF-8 code unit sequence, it follows that the doc wording above precisely describes the current behavior here.

The key here is that even though you can follow the "UTF-8 encoding algorithm" to encode surrogate codepoints, the definition of UTF-8 itself excludes them as valid sequences. And any correct UTF-8 decoder will indeed reject them.

If you do need to handle surrogates differently than regular invalid UTF-8 (I am somewhat surprised by that, although I am not familiar with the Unicode Collation Algorithm), then I think it's probably reasonable at that point to write your own UTF-8 decoder. Or somehow build something on top of what bstr provides.

Otherwise, with respect to the docs, I do tend to use "codepoint" instead of the more precisely correct term "Unicode scalar value" because "codepoint" is much less of a mouthful. Also, since bstr only cares about UTF-8, I think the context sort of implies that "codepoint" refers to "Unicode scalar value" rather than literally every codepoint. But it might not be a bad idea to at least explicitly call that out somewhere in the docs.

0 replies

theodore-s-beers · 2022-07-12T13:35:14Z

theodore-s-beers
Jul 12, 2022
Author

I looked into it more, and bstr does follow the "maximal subpart" approach correctly:

Also, every byte of a sequence that would correspond to a surrogate code point, or of a truncated version thereof, is replaced with one U+FFFD, as shown in Table 3-9. (The interpretation of such byte sequences has been forbidden since Unicode 3.2.)

Afaik, this shouldn't cause a problem for normalization (which I saw listed as "maybe in scope" for bstr). If you encounter a sequence containing a surrogate code point, and replace that with repeated U+FFFD, it shouldn't make much difference, since such code points couldn't be affected by decomposition or reordering, anyway.

The weird case is collation (which, iirc, is "out of scope" here). The algorithm specification has a few things to say about these problems:

Unicode strings sometimes contain ill-formed code unit sequences. Such ill-formed sequences must not be interpreted as valid Unicode characters. See Section 3.2, Conformance Requirements in [Unicode]. For example, expressed in UTF-32, a Unicode string might contain a 32-bit value corresponding to a surrogate code point… Implementations of the Unicode Collation Algorithm may choose to treat such ill-formed code unit sequences as error conditions and respond appropriately, such as by throwing an exception.

An implementation of the Unicode Collation Algorithm may also choose not to treat ill-formed sequences as an error condition, but instead to give them explicit weights. This strategy provides for determinant comparison results for Unicode strings, even when they contain ill-formed sequences. However, to avoid security issues when using this strategy, ill-formed code sequences should not be given an ignorable or variable primary weight.

There are two recommended approaches, based on how these ill-formed sequences are typically handled by character set converters. The first approach is to weight each maximal ill-formed subsequence as if it were U+FFFD REPLACEMENT CHARACTER. (For more information about maximal ill-formed subsequences, see Section 3.9, Unicode Encoding Forms in [Unicode].) A second approach is to generate an implicit weight for any surrogate code point as if it were an unassigned code point, using the method of Section 10.1.3, Implicit Weights.

However, out of the box, the conformance tests expect the second of those approaches. e.g., they want the sequence D801 0021 to sort greater than or equal to D800 0062—which is possible only if a higher weight has been generated for D801. I got that to work in my basic implementation, but it's hacky and makes various optimizations (which might be facilitated by bstr) more difficult. The test docs do say that problematic lines can be filtered out for implementations that refuse to calculate weights for surrogate code points. Maybe that's a better approach; I just don't like the fact that it's seemingly non-default.

Anyway, thanks for your help!

1 reply

BurntSushi Jul 12, 2022
Maintainer

w.r.t. normalization, yes, I'd probably be willing to bring in an implementation of it that works on conventional UTF-8. For UCA though, my perception is that its complexity is so great that I probably just don't have the bandwidth to maintain something like that inside of bstr.

The test docs do say that problematic lines can be filtered out for implementations that refuse to calculate weights for surrogate code points. Maybe that's a better approach; I just don't like the fact that it's seemingly non-default.

My bias is to make such things work on any input. That is why bstr exists in the first place. There's nothing worse than loading some data into your program and having it crap out because there was one invalid byte somewhere. Then you're stuck and can't do anything. Obviously, this is a trade off: by not crapping out on an invalid byte, you might be contributing to an overall worse failure mode. e.g., Maybe the end user wants to know about invalid data and have a chance to fix it.

Like I said, I'm not terribly familiar with UCA, but I would be willing to bet that it would be quite useful to make it work on data that is only conventionally UTF-8. So not just surrogate codepoints, but also any arbitrarily invalid UTF-8. I'm thinking of programs like sort, for example, where I would be really frustrated if it crapped out in the middle of sorting a large file because it wasn't purely valid UTF-8. Instead, I'd probably hope that it would "just do something reasonable."

Anyway, that's my uninformed 2 cents.

theodore-s-beers · 2022-07-12T16:07:28Z

theodore-s-beers
Jul 12, 2022
Author

Like I said, I'm not terribly familiar with UCA, but I would be willing to bet that it would be quite useful to make it work on data that is only conventionally UTF-8. So not just surrogate codepoints, but also any arbitrarily invalid UTF-8.

Yeah, that's totally possible, and bstr-compatible. Take bstr chars, which come with validation; normalize them to form NFD (which doesn't need to be difficult, though perf can be a challenge); then put that through the UCA. Invalid sequences will have been converted to U+FFFD, which is assigned a weight (a very high one) in the collation element table. Any UCA implementation can deal with U+FFFD. The questionable part is calculating different weights for different surrogate code points. Set that goal aside, and there's no problem.

3 replies

BurntSushi Jul 12, 2022
Maintainer

I see. In that case, can you take the first approach? Where any surrogate is just given the same weight as U+FFFD?

theodore-s-beers Jul 14, 2022
Author

Yeah. I've switched my UCA implementation (still in an early stage of development, and maybe more of a learning project for me than something destined for serious use) to the "treat all surrogate code points as U+FFFD" approach. It just makes more sense to accept &str or &[u8], validate input (incl. fixing things like surrogates), produce a list of scalar values, and get on with collation.

BurntSushi Jul 14, 2022
Maintainer

Absent some other practical concern I don't know about, I agree. :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treatment of surrogate code points #119

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Treatment of surrogate code points #119

theodore-s-beers Jul 12, 2022

Replies: 4 comments · 4 replies

theodore-s-beers Jul 12, 2022 Author

BurntSushi Jul 12, 2022 Maintainer

theodore-s-beers Jul 12, 2022 Author

BurntSushi Jul 12, 2022 Maintainer

theodore-s-beers Jul 12, 2022 Author

BurntSushi Jul 12, 2022 Maintainer

theodore-s-beers Jul 14, 2022 Author

BurntSushi Jul 14, 2022 Maintainer

theodore-s-beers
Jul 12, 2022

Replies: 4 comments 4 replies

theodore-s-beers
Jul 12, 2022
Author

BurntSushi
Jul 12, 2022
Maintainer

theodore-s-beers
Jul 12, 2022
Author

BurntSushi Jul 12, 2022
Maintainer

theodore-s-beers
Jul 12, 2022
Author

BurntSushi Jul 12, 2022
Maintainer

theodore-s-beers Jul 14, 2022
Author

BurntSushi Jul 14, 2022
Maintainer