-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Fix wrong Char
posix whitespace designation
#3983
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: martinvuyk <[email protected]>
FYI @ConnorGray as you're actively working in this space |
@martinvuyk do you mind rebasing/resolving conflicts and then we can sync this? Thanks! |
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoeLoser done :)
@@ -327,7 +327,7 @@ fn _trim_and_handle_sign(str_slice: StringSlice, str_len: Int) -> (Int, Bool): | |||
""" | |||
var buff = str_slice.unsafe_ptr() | |||
var start: Int = 0 | |||
while start < str_len and Codepoint(buff[start]).is_posix_space(): | |||
while start < str_len and Codepoint(buff[start]).is_ascii_space(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this is much more inefficient than using the original _isspace()
(what this method's implementation uses) directly since it is not transforming to UTF-32 on each iteration (it just needs to compare a byte). Branch prediction might lessen the effect at runtime but some penalty is still there in the instruction fetching pipeline. It's one of the motivations behind #3988
!sync |
Thanks for tackling this Martin. Looking deeper into this, I have some thoughts. First, I notice our current Looking at definitions of whitespace from other systems:
However, I do notice that Python's
Which perhaps is why our current code also includes record separators as whitespace, to match Python's idea of what ASCII characters are whitespace. As a further nuance, Given all that, I think that the concept of "ASCII whitespace" is not well defined, but there are two things that are well-defined:
and my thoughts re. the best course of action are:
What do you think about that? I've dug into some of the nuance here, but this is a complicated space and I'm sure there's pieces I've missed, so would love to hear your thoughts 🙂 |
Yes when I implemented the isspace function it was !fun to have some very deep dives into other languages and some very obscure docs. No one seems to document these things properly (I had a tiny meltdown in #3843 when I saw the character set I wrote deleted). My guess was that And then you made me google it XD. Here is an excerpt from this stackoverflow question:
So yes, very much legacy. But they are separators in the ASCII standard, just not well documented anywhere.
While implementing our whitespace-related code I came across many such inconsistencies in other languages/libc implementations...
We should IMO absolutely commit to being python-compatible in things like this, otherwise code won't be able to be migrated. This is yet another argument for having String parametrized, if we know the We could also make supporting "legacy_ascii" a parameter that by default is Or we could just drop them. WDYT? |
Signed-off-by: martinvuyk <[email protected]>
Posix considers only
" \\t\\n\\v\\f\\r"
whereas the implementation takes ascii whitespace into account as well.