Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fuzzer for libstd++ internal unicode iterator #6

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pauldreik
Copy link
Owner

this is for an internal type recently added to libstdc++

no bugs were found.

@pauldreik pauldreik changed the title add fuzzer for ibstd++ internal unicode iterator add fuzzer for libstd++ internal unicode iterator Jan 14, 2024

// this does not cause a runtime error, the iterator class protects against
// misuse
[[maybe_unused]] auto illegal = *v.end();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, incrementing the end iterator (and dereferencing it after that) should work without error, any number of times.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added incrementing and dereferencing.

}

// could anything be asserted by count vs. data.size()?
// assert(count <= data.size());
Copy link
Contributor

@jwakely jwakely Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work when the output type is char32_t, but not otherwise.

For 32-bit input and char32_t output you should get count == data.size(), because either every input character is a valid UTF-32 code point that can be returned as char32_t, or it's invalid and will be returned as the U'\uFFFD' replacement character. Either way, it's 1 input to 1 output.

For 16-bit and 8-bit input and char32_t output then count <= data.size() should be true. Some 16- or 8-bit input values represent a single code point, and so get returned as a char32_t output value. But some 16-bit input values are part of a surrogate pair and so two 16-bit code units are consumed to produce a single UTF-32 output. And some 8-bit input values are part of a multibyte sequence, in which case up to 4 bytes can be consumed to produce a single UTF-32 output. In all cases, count < data.size() holds.

But when the output is 8-bit or 16-bit, then you might get multiple output values for a single input. e.g. the input range U"£" consists of a single char32_t code unit, but is encoded as two bytes in UTF-8. Other code points require three or four bytes. So count >= data.size(). Similarly, any code point that doesn't fit in a single UTF-16 code unit will get encoded as a surrogate pair, so one 32-bit input value can produce up to two 16-bit output values. And 16-bit inputs with 8-bit outputs can also give an output count higher than the input size.

I think it's accurate to say that if sizeof(input) == sizeof(output) then count <= data.size() is true ... but I'd have to think about it further. Even in that case it's not the case that count == data.size() because invalid UTF-8 sequences can be transformed to U+FFFD which requires three bytes, so if the input consists of four bytes, the output will be smaller.

Given the complexity of figuring out what the valid inequalities are for different sized inputs and outputs, it might be best to not assert anything here, at least for now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this comment into the fuzzer along with checks. The check in the second last paragraph does not pass and I do not know enough about unicode to provide any value to this conversation.... I left it commented out.

pauldreik added a commit that referenced this pull request Jul 27, 2024
@pauldreik pauldreik force-pushed the libstdcpp-unicode-iterator branch from 79f3430 to e594129 Compare July 27, 2024 18:03
@pauldreik pauldreik force-pushed the libstdcpp-unicode-iterator branch from e594129 to 2e9c355 Compare July 27, 2024 18:14
@pauldreik
Copy link
Owner Author

I added validation against the simdutf library - I do however not get it to pass.

question 1: Is the output from the iterator guaranteed to be valid regardless of input?

question 2: is the utf16 returned in big, little or native endian?

@jwakely
Copy link
Contributor

jwakely commented Jul 27, 2024

question 1: Is the output from the iterator guaranteed to be valid regardless of input?

Yes, invalid input sequences should be transformed into U+FFFD.

question 2: is the utf16 returned in big, little or native endian?

UTF-16BE IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants