Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

printf: improve support of printing multi-byte values of characters #7048

Closed
wants to merge 7 commits into from

Conversation

sylvestre
Copy link
Contributor

No description provided.

@sylvestre sylvestre force-pushed the printf-go branch 7 times, most recently from 95a1333 to 80988c3 Compare January 1, 2025 17:46
Copy link

github-actions bot commented Jan 1, 2025

GNU testsuite comparison:

Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Copy link

github-actions bot commented Jan 1, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

@sylvestre sylvestre force-pushed the printf-go branch 3 times, most recently from 0e9e153 to 63eb50f Compare January 1, 2025 18:35
Copy link

github-actions bot commented Jan 1, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

@sylvestre
Copy link
Contributor Author

@jtracey given your work on #7020 i am wondering if you would be able to help with the windows support ? :) thanks

#[cfg(windows)]
let format_vec: Vec<u8> = format
.encode_wide()
.flat_map(|wchar| wchar.to_le_bytes())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't do what we want, because it's effectively casting UTF-16 into a byte array, and UTF-16 is not byte-compatible with UTF-8 (e.g., ASCII in UTF-8 is just the literal ASCII byte per char, while ASCII in UTF-16 will be two bytes each). Because invalid unicode is much rarer in Windows, the way to turn OsStr(ing)s into bytes for us is using os_str_as_bytes, which will error on invalid unicode on Windows, or os_str_as_bytes_lossy, which will turn invalid unicode sequences into the replacement character on Windows. Windows makes it difficult to pass invalid unicode as an argument, so whichever feels more appropriate should be fine here.

Also, a nit that won't be relevant after the change, but cfgs assuming unix or windows is a code smell, since there are platforms that are neither (none that we support yet, but I hope to change that at some point 😛).

.collect();
FormatArgument::Unparsed(
String::from_utf8(raw_bytes.clone())
.unwrap_or_else(|_| raw_bytes.iter().map(|&b| b as char).collect()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find it documented anywhere, but collecting b as char has unpredictable behavior. Casting a u8 to char casts to the char with the code point with that value, not the UTF-8 parsing of that value. This was made invisible because somewhere, Rust seems to be doing some kind of dynamic dispatch to instead collect the bytes into a UTF-8 string if the whole iterator is valid UTF-8. Here's a playground if you want a more detailed demonstration.

Ideally, the FormatArgument::Unparsed value should contain bytes or an OsString instead of a String, but assuming that's outside the scope of this PR, it's better to specify what should happen if a non-UTF-8 string is passed in, using to_string or to_string_lossy, avoiding the encoding of values as bytes entirely for now (this also avoids allocation overhead, since char is the size of a u32).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtracey would you like to try to finish this PR ? :)

Copy link

github-actions bot commented Jan 6, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/usage_vs_getopt (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

@jtracey
Copy link
Contributor

jtracey commented Jan 24, 2025

Sorry for the delayed response. Your latest commit close but not quite right (collecting bytes has the problem I mentioned before, even if the compiler isn't complaining). Hopefully my version #7208 works for you?

@sylvestre
Copy link
Contributor Author

excellent, many thanks :)

@sylvestre sylvestre closed this Jan 24, 2025
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants