-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fuzzer for libstd++ internal unicode iterator #6
base: main
Are you sure you want to change the base?
Conversation
libstdcpp/unicode-utf-iterator.cpp
Outdated
|
||
// this does not cause a runtime error, the iterator class protects against | ||
// misuse | ||
[[maybe_unused]] auto illegal = *v.end(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, incrementing the end iterator (and dereferencing it after that) should work without error, any number of times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added incrementing and dereferencing.
libstdcpp/unicode-utf-iterator.cpp
Outdated
} | ||
|
||
// could anything be asserted by count vs. data.size()? | ||
// assert(count <= data.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will work when the output type is char32_t
, but not otherwise.
For 32-bit input and char32_t
output you should get count == data.size()
, because either every input character is a valid UTF-32 code point that can be returned as char32_t
, or it's invalid and will be returned as the U'\uFFFD'
replacement character. Either way, it's 1 input to 1 output.
For 16-bit and 8-bit input and char32_t
output then count <= data.size()
should be true. Some 16- or 8-bit input values represent a single code point, and so get returned as a char32_t
output value. But some 16-bit input values are part of a surrogate pair and so two 16-bit code units are consumed to produce a single UTF-32 output. And some 8-bit input values are part of a multibyte sequence, in which case up to 4 bytes can be consumed to produce a single UTF-32 output. In all cases, count < data.size()
holds.
But when the output is 8-bit or 16-bit, then you might get multiple output values for a single input. e.g. the input range U"£"
consists of a single char32_t
code unit, but is encoded as two bytes in UTF-8. Other code points require three or four bytes. So count >= data.size()
. Similarly, any code point that doesn't fit in a single UTF-16 code unit will get encoded as a surrogate pair, so one 32-bit input value can produce up to two 16-bit output values. And 16-bit inputs with 8-bit outputs can also give an output count higher than the input size.
I think it's accurate to say that if sizeof(input) == sizeof(output) then count <= data.size()
is true ... but I'd have to think about it further. Even in that case it's not the case that count == data.size()
because invalid UTF-8 sequences can be transformed to U+FFFD which requires three bytes, so if the input consists of four bytes, the output will be smaller.
Given the complexity of figuring out what the valid inequalities are for different sized inputs and outputs, it might be best to not assert anything here, at least for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this comment into the fuzzer along with checks. The check in the second last paragraph does not pass and I do not know enough about unicode to provide any value to this conversation.... I left it commented out.
79f3430
to
e594129
Compare
e594129
to
2e9c355
Compare
I added validation against the simdutf library - I do however not get it to pass. question 1: Is the output from the iterator guaranteed to be valid regardless of input? question 2: is the utf16 returned in big, little or native endian? |
Yes, invalid input sequences should be transformed into U+FFFD.
UTF-16BE IIRC. |
this is for an internal type recently added to libstdc++
no bugs were found.