Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix wrong Char posix whitespace designation #3983

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

Posix considers only " \\t\\n\\v\\f\\r" whereas the implementation takes ascii whitespace into account as well.

@JoeLoser JoeLoser requested a review from ConnorGray February 11, 2025 16:15
@JoeLoser
Copy link
Collaborator

FYI @ConnorGray as you're actively working in this space

@JoeLoser
Copy link
Collaborator

@martinvuyk do you mind rebasing/resolving conflicts and then we can sync this? Thanks!

Copy link
Contributor Author

@martinvuyk martinvuyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoeLoser done :)

@@ -327,7 +327,7 @@ fn _trim_and_handle_sign(str_slice: StringSlice, str_len: Int) -> (Int, Bool):
"""
var buff = str_slice.unsafe_ptr()
var start: Int = 0
while start < str_len and Codepoint(buff[start]).is_posix_space():
while start < str_len and Codepoint(buff[start]).is_ascii_space():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this is much more inefficient than using the original _isspace() (what this method's implementation uses) directly since it is not transforming to UTF-32 on each iteration (it just needs to compare a byte). Branch prediction might lessen the effect at runtime but some penalty is still there in the instruction fetching pipeline. It's one of the motivations behind #3988

@JoeLoser
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Feb 24, 2025
@ConnorGray
Copy link
Collaborator

ConnorGray commented Feb 26, 2025

Thanks for tackling this Martin. Looking deeper into this, I have some thoughts.

First, I notice our current is_posix_space() function considers 0x1C (File Separator), 0x1D (Group Separator), and 0x1E (Record Separator) as whitespace characters. That was unexpected to me, and I see you're fixing that. 👍 to that 🙂.

Looking at definitions of whitespace from other systems:

However, I do notice that Python's str.isspace() method does consider record separators as whitespace:

Return True if there are only whitespace characters in the string and there is at least one character, False otherwise.

A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs (“Separator, space”), or its bidirectional class is one of WS, B, or S.

Which perhaps is why our current code also includes record separators as whitespace, to match Python's idea of what ASCII characters are whitespace.

As a further nuance, 0x0B (\v) Vertical Tab is considered whitespace by std::isspace, the POSIX locale, and has the White_Space Unicode character, but is not considered ASCII whitespace by Rust's char::is_ascii_whitespace(). (As an aside, this means that Rust's char::is_whitespace() is not actually a superset of char::is_ascii_whitespace.)

Given all that, I think that the concept of "ASCII whitespace" is not well defined, but there are two things that are well-defined:

  • POSIX whitespace
  • Python whitespace

and my thoughts re. the best course of action are:

  1. Fix our is_posix_space() function (as you do here) to not include separator characters
  2. I'm unsure of the value in adding an is_ascii_space() with unclear semantics.
  3. Either continue using is_posix_space(), or commit to using is_python_space() in string processing functions like atol().

What do you think about that? I've dug into some of the nuance here, but this is a complicated space and I'm sure there's pieces I've missed, so would love to hear your thoughts 🙂

@martinvuyk
Copy link
Contributor Author

Yes when I implemented the isspace function it was !fun to have some very deep dives into other languages and some very obscure docs. No one seems to document these things properly (I had a tiny meltdown in #3843 when I saw the character set I wrote deleted).

My guess was that \x1c, \x1d, \x1e are legacy characters that aren't used nowadays. But maybe in the old days of the wild 7 bit char and 36 bit word sizes, there were many such characters still in use. \r\n in windows is still legacy form the typewriter era...

And then you made me google it XD. Here is an excerpt from this stackoverflow question:

28 – FS – File separator
The file separator FS is an interesting control code, as it gives us insight in the way that computer technology was organized in the sixties. We are now used to random access media like RAM and magnetic disks, but when the ASCII standard was defined, most data was serial. I am not only talking about serial communications, but also about serial storage like punch cards, paper tape and magnetic tapes. In such a situation it is clearly efficient to have a single control code to signal the separation of two files. The FS was defined for this purpose.

29 – GS – Group separator
Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn't used at that moment and the ASCII people called it a group.

30 – RS – Record separator
Within a group (or table) the records are separated with RS or record separator.

31 – US – Unit separator
The smallest data items to be stored in a database are called units in the ASCII definition. We would call them field now. The unit separator separates these fields in a serial data storage environment. Most current database implementations require that fields of most types have a fixed length. Enough space in the record is allocated to store the largest possible member of each field, even if this is not necessary in most cases. This costs a large amount of space in many situations. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. I can't imagine a situation where modern SQL databases are run with the data stored on paper tape or magnetic reels...

So yes, very much legacy. But they are separators in the ASCII standard, just not well documented anywhere.

As a further nuance, 0x0B (\v) Vertical Tab is considered whitespace by std::isspace, the POSIX locale, and has the White_Space Unicode character, but is not considered ASCII whitespace by Rust's char::is_ascii_whitespace(). (As an aside, this means that Rust's char::is_whitespace() is not actually a superset of char::is_ascii_whitespace.)

While implementing our whitespace-related code I came across many such inconsistencies in other languages/libc implementations...

Either continue using is_posix_space(), or commit to using is_python_space() in string processing functions like atol().

We should IMO absolutely commit to being python-compatible in things like this, otherwise code won't be able to be migrated. This is yet another argument for having String parametrized, if we know the string.encoding == Encoding.ASCII then we can ignore unicode whitespace characters and use only ASCII ones. Even though \r\n will still be a headache, it will mostly speed things up.

We could also make supporting "legacy_ascii" a parameter that by default is False for all functions which use .isspace(). So that when the niche use-case for reading digitalized magnetic tape or punch-cards does arise, they can process them just the same.

Or we could just drop them.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants