[stdlib] Fix wrong `Char` posix whitespace designation #3983

martinvuyk · 2025-02-01T18:31:49Z

Posix considers only " \\t\\n\\v\\f\\r" whereas the implementation takes ascii whitespace into account as well.

Signed-off-by: martinvuyk <[email protected]>

JoeLoser · 2025-02-19T20:25:28Z

FYI @ConnorGray as you're actively working in this space

JoeLoser · 2025-02-21T15:47:03Z

@martinvuyk do you mind rebasing/resolving conflicts and then we can sync this? Thanks!

…ce-func

Signed-off-by: martinvuyk <[email protected]>

martinvuyk

@JoeLoser done :)

martinvuyk · 2025-02-21T17:02:23Z

stdlib/src/collections/string/string.mojo

@@ -327,7 +327,7 @@ fn _trim_and_handle_sign(str_slice: StringSlice, str_len: Int) -> (Int, Bool):
    """
    var buff = str_slice.unsafe_ptr()
    var start: Int = 0
-    while start < str_len and Codepoint(buff[start]).is_posix_space():
+    while start < str_len and Codepoint(buff[start]).is_ascii_space():


FYI this is much more inefficient than using the original _isspace() (what this method's implementation uses) directly since it is not transforming to UTF-32 on each iteration (it just needs to compare a byte). Branch prediction might lessen the effect at runtime but some penalty is still there in the instruction fetching pipeline. It's one of the motivations behind #3988

JoeLoser · 2025-02-24T18:17:23Z

!sync

ConnorGray · 2025-02-26T20:27:38Z

Thanks for tackling this Martin. Looking deeper into this, I have some thoughts.

First, I notice our current is_posix_space() function considers 0x1C (File Separator), 0x1D (Group Separator), and 0x1E (Record Separator) as whitespace characters. That was unexpected to me, and I see you're fixing that. 👍 to that 🙂.

Looking at definitions of whitespace from other systems:

The POSIX locale doesn't include those characters as whitespace.
C++'s std::isspace does not consider those as whitespace.
Rust's char::is_ascii_whitespace() does not consider those as whitespace.
Those characters do not have the Unicode White_Space character property (as tested by this Rust code using char::is_whitespace()).

However, I do notice that Python's str.isspace() method does consider record separators as whitespace:

Return True if there are only whitespace characters in the string and there is at least one character, False otherwise.

A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs (“Separator, space”), or its bidirectional class is one of WS, B, or S.

Which perhaps is why our current code also includes record separators as whitespace, to match Python's idea of what ASCII characters are whitespace.

As a further nuance, 0x0B (\v) Vertical Tab is considered whitespace by std::isspace, the POSIX locale, and has the White_Space Unicode character, but is not considered ASCII whitespace by Rust's char::is_ascii_whitespace(). (As an aside, this means that Rust's char::is_whitespace() is not actually a superset of char::is_ascii_whitespace.)

Given all that, I think that the concept of "ASCII whitespace" is not well defined, but there are two things that are well-defined:

POSIX whitespace
Python whitespace

and my thoughts re. the best course of action are:

Fix our is_posix_space() function (as you do here) to not include separator characters
I'm unsure of the value in adding an is_ascii_space() with unclear semantics.
Either continue using is_posix_space(), or commit to using is_python_space() in string processing functions like atol().

What do you think about that? I've dug into some of the nuance here, but this is a complicated space and I'm sure there's pieces I've missed, so would love to hear your thoughts 🙂

martinvuyk · 2025-02-26T21:54:34Z

Yes when I implemented the isspace function it was !fun to have some very deep dives into other languages and some very obscure docs. No one seems to document these things properly (I had a tiny meltdown in #3843 when I saw the character set I wrote deleted).

My guess was that \x1c, \x1d, \x1e are legacy characters that aren't used nowadays. But maybe in the old days of the wild 7 bit char and 36 bit word sizes, there were many such characters still in use. \r\n in windows is still legacy form the typewriter era...

And then you made me google it XD. Here is an excerpt from this stackoverflow question:

28 – FS – File separator
The file separator FS is an interesting control code, as it gives us insight in the way that computer technology was organized in the sixties. We are now used to random access media like RAM and magnetic disks, but when the ASCII standard was defined, most data was serial. I am not only talking about serial communications, but also about serial storage like punch cards, paper tape and magnetic tapes. In such a situation it is clearly efficient to have a single control code to signal the separation of two files. The FS was defined for this purpose.

29 – GS – Group separator
Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn't used at that moment and the ASCII people called it a group.

30 – RS – Record separator
Within a group (or table) the records are separated with RS or record separator.

31 – US – Unit separator
The smallest data items to be stored in a database are called units in the ASCII definition. We would call them field now. The unit separator separates these fields in a serial data storage environment. Most current database implementations require that fields of most types have a fixed length. Enough space in the record is allocated to store the largest possible member of each field, even if this is not necessary in most cases. This costs a large amount of space in many situations. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. I can't imagine a situation where modern SQL databases are run with the data stored on paper tape or magnetic reels...

So yes, very much legacy. But they are separators in the ASCII standard, just not well documented anywhere.

As a further nuance, 0x0B (\v) Vertical Tab is considered whitespace by std::isspace, the POSIX locale, and has the White_Space Unicode character, but is not considered ASCII whitespace by Rust's char::is_ascii_whitespace(). (As an aside, this means that Rust's char::is_whitespace() is not actually a superset of char::is_ascii_whitespace.)

While implementing our whitespace-related code I came across many such inconsistencies in other languages/libc implementations...

Either continue using is_posix_space(), or commit to using is_python_space() in string processing functions like atol().

We should IMO absolutely commit to being python-compatible in things like this, otherwise code won't be able to be migrated. This is yet another argument for having String parametrized, if we know the string.encoding == Encoding.ASCII then we can ignore unicode whitespace characters and use only ASCII ones. Even though \r\n will still be a headache, it will mostly speed things up.

We could also make supporting "legacy_ascii" a parameter that by default is False for all functions which use .isspace(). So that when the niche use-case for reading digitalized magnetic tape or punch-cards does arise, they can process them just the same.

Or we could just drop them.

WDYT?

…ce-func

Signed-off-by: martinvuyk <[email protected]>

fix wrong whitespace designation

c29f8b5

Signed-off-by: martinvuyk <[email protected]>

JoeLoser requested a review from ConnorGray February 11, 2025 16:15

JoeLoser assigned ConnorGray Feb 19, 2025

martinvuyk added 4 commits February 21, 2025 13:51

Merge remote-tracking branch 'upstream/main' into fix-posix-ascii-spa…

d764e72

…ce-func

fix after merge

f5c9ae5

Signed-off-by: martinvuyk <[email protected]>

fix after merge

3690ad4

Signed-off-by: martinvuyk <[email protected]>

fix after merge

6405bea

Signed-off-by: martinvuyk <[email protected]>

martinvuyk commented Feb 21, 2025

View reviewed changes

Merge branch 'main' into fix-posix-ascii-space-func

15d4d3c

modular-automation bot assigned JoeLoser Feb 24, 2025

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Feb 24, 2025

martinvuyk added 2 commits February 28, 2025 07:45

Merge remote-tracking branch 'upstream/main' into fix-posix-ascii-spa…

4e4c61e

…ce-func

fix after merge

79c8fe5

Signed-off-by: martinvuyk <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Fix wrong `Char` posix whitespace designation #3983

[stdlib] Fix wrong `Char` posix whitespace designation #3983

martinvuyk commented Feb 1, 2025

JoeLoser commented Feb 19, 2025

JoeLoser commented Feb 21, 2025

martinvuyk left a comment

martinvuyk Feb 21, 2025

JoeLoser commented Feb 24, 2025

ConnorGray commented Feb 26, 2025 •

edited

Loading

martinvuyk commented Feb 26, 2025

[stdlib] Fix wrong Char posix whitespace designation #3983

Are you sure you want to change the base?

[stdlib] Fix wrong Char posix whitespace designation #3983

Conversation

martinvuyk commented Feb 1, 2025

JoeLoser commented Feb 19, 2025

JoeLoser commented Feb 21, 2025

martinvuyk left a comment

Choose a reason for hiding this comment

martinvuyk Feb 21, 2025

Choose a reason for hiding this comment

JoeLoser commented Feb 24, 2025

ConnorGray commented Feb 26, 2025 • edited Loading

martinvuyk commented Feb 26, 2025

[stdlib] Fix wrong `Char` posix whitespace designation #3983

[stdlib] Fix wrong `Char` posix whitespace designation #3983

ConnorGray commented Feb 26, 2025 •

edited

Loading