-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
printf: improve support of printing multi-byte values of characters #7048
Conversation
95a1333
to
80988c3
Compare
GNU testsuite comparison:
|
GNU testsuite comparison:
|
0e9e153
to
63eb50f
Compare
GNU testsuite comparison:
|
src/uu/printf/src/printf.rs
Outdated
#[cfg(windows)] | ||
let format_vec: Vec<u8> = format | ||
.encode_wide() | ||
.flat_map(|wchar| wchar.to_le_bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't do what we want, because it's effectively casting UTF-16 into a byte array, and UTF-16 is not byte-compatible with UTF-8 (e.g., ASCII in UTF-8 is just the literal ASCII byte per char, while ASCII in UTF-16 will be two bytes each). Because invalid unicode is much rarer in Windows, the way to turn OsStr(ing)s into bytes for us is using os_str_as_bytes
, which will error on invalid unicode on Windows, or os_str_as_bytes_lossy
, which will turn invalid unicode sequences into the replacement character on Windows. Windows makes it difficult to pass invalid unicode as an argument, so whichever feels more appropriate should be fine here.
Also, a nit that won't be relevant after the change, but cfgs assuming unix or windows is a code smell, since there are platforms that are neither (none that we support yet, but I hope to change that at some point 😛).
src/uu/printf/src/printf.rs
Outdated
.collect(); | ||
FormatArgument::Unparsed( | ||
String::from_utf8(raw_bytes.clone()) | ||
.unwrap_or_else(|_| raw_bytes.iter().map(|&b| b as char).collect()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find it documented anywhere, but collecting b as char
has unpredictable behavior. Casting a u8
to char
casts to the char with the code point with that value, not the UTF-8 parsing of that value. This was made invisible because somewhere, Rust seems to be doing some kind of dynamic dispatch to instead collect the bytes into a UTF-8 string if the whole iterator is valid UTF-8. Here's a playground if you want a more detailed demonstration.
Ideally, the FormatArgument::Unparsed
value should contain bytes or an OsString instead of a String, but assuming that's outside the scope of this PR, it's better to specify what should happen if a non-UTF-8 string is passed in, using to_string
or to_string_lossy
, avoiding the encoding of values
as bytes entirely for now (this also avoids allocation overhead, since char
is the size of a u32
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtracey would you like to try to finish this PR ? :)
GNU testsuite comparison:
|
GNU testsuite comparison:
|
Sorry for the delayed response. Your latest commit close but not quite right (collecting bytes has the problem I mentioned before, even if the compiler isn't complaining). Hopefully my version #7208 works for you? |
excellent, many thanks :) |
GNU testsuite comparison:
|
No description provided.