add fuzzer for libstd++ internal unicode iterator #6

pauldreik · 2024-01-14T10:14:19Z

this is for an internal type recently added to libstdc++

no bugs were found.

jwakely · 2024-01-16T14:12:02Z

libstdcpp/unicode-utf-iterator.cpp

+
+  // this does not cause a runtime error, the iterator class protects against
+  // misuse
+  [[maybe_unused]] auto illegal = *v.end();


Similarly, incrementing the end iterator (and dereferencing it after that) should work without error, any number of times.

I added incrementing and dereferencing.

jwakely · 2024-01-16T14:28:52Z

libstdcpp/unicode-utf-iterator.cpp

+  }
+
+  // could anything be asserted by count vs. data.size()?
+  //  assert(count <= data.size());


This will work when the output type is char32_t, but not otherwise.

For 32-bit input and char32_t output you should get count == data.size(), because either every input character is a valid UTF-32 code point that can be returned as char32_t, or it's invalid and will be returned as the U'\uFFFD' replacement character. Either way, it's 1 input to 1 output.

For 16-bit and 8-bit input and char32_t output then count <= data.size() should be true. Some 16- or 8-bit input values represent a single code point, and so get returned as a char32_t output value. But some 16-bit input values are part of a surrogate pair and so two 16-bit code units are consumed to produce a single UTF-32 output. And some 8-bit input values are part of a multibyte sequence, in which case up to 4 bytes can be consumed to produce a single UTF-32 output. In all cases, count < data.size() holds.

But when the output is 8-bit or 16-bit, then you might get multiple output values for a single input. e.g. the input range U"£" consists of a single char32_t code unit, but is encoded as two bytes in UTF-8. Other code points require three or four bytes. So count >= data.size(). Similarly, any code point that doesn't fit in a single UTF-16 code unit will get encoded as a surrogate pair, so one 32-bit input value can produce up to two 16-bit output values. And 16-bit inputs with 8-bit outputs can also give an output count higher than the input size.

I think it's accurate to say that if sizeof(input) == sizeof(output) then count <= data.size() is true ... but I'd have to think about it further. Even in that case it's not the case that count == data.size() because invalid UTF-8 sequences can be transformed to U+FFFD which requires three bytes, so if the input consists of four bytes, the output will be smaller.

Given the complexity of figuring out what the valid inequalities are for different sized inputs and outputs, it might be best to not assert anything here, at least for now.

I put this comment into the fuzzer along with checks. The check in the second last paragraph does not pass and I do not know enough about unicode to provide any value to this conversation.... I left it commented out.

this is safe: #6 (comment)

from https://github.com/simdutf/simdutf/releases/tag/v5.3.0

pauldreik · 2024-07-27T18:17:44Z

I added validation against the simdutf library - I do however not get it to pass.

question 1: Is the output from the iterator guaranteed to be valid regardless of input?

question 2: is the utf16 returned in big, little or native endian?

jwakely · 2024-07-27T18:49:30Z

question 1: Is the output from the iterator guaranteed to be valid regardless of input?

Yes, invalid input sequences should be transformed into U+FFFD.

question 2: is the utf16 returned in big, little or native endian?

UTF-16BE IIRC.

pauldreik changed the title ~~add fuzzer for ibstd++ internal unicode iterator~~ add fuzzer for libstd++ internal unicode iterator Jan 14, 2024

jwakely approved these changes Jan 16, 2024

View reviewed changes

jwakely reviewed Jan 16, 2024

View reviewed changes

pauldreik added 2 commits July 27, 2024 17:21

add method for getting fuzzdata as a span

e28f58a

add method to get fuzzdata as a bit blastable type

6bdb8d1

pauldreik added a commit that referenced this pull request Jul 27, 2024

intentionally step past the end iterator

2ac5a45

this is safe: #6 (comment)

pauldreik force-pushed the libstdcpp-unicode-iterator branch from 79f3430 to e594129 Compare July 27, 2024 18:03

pauldreik added 3 commits July 27, 2024 20:09

bump gcc version number to 15.0.0

31e6e29

add simdutf 5.3.0

1253c74

from https://github.com/simdutf/simdutf/releases/tag/v5.3.0

add fuzzer for libstdc++ internal unicode iterator

2e9c355

pauldreik force-pushed the libstdcpp-unicode-iterator branch from e594129 to 2e9c355 Compare July 27, 2024 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fuzzer for libstd++ internal unicode iterator #6

add fuzzer for libstd++ internal unicode iterator #6

pauldreik commented Jan 14, 2024

jwakely Jan 16, 2024

pauldreik Jul 27, 2024

jwakely Jan 16, 2024 •

edited

Loading

pauldreik Jul 27, 2024

pauldreik commented Jul 27, 2024

jwakely commented Jul 27, 2024

add fuzzer for libstd++ internal unicode iterator #6

Are you sure you want to change the base?

add fuzzer for libstd++ internal unicode iterator #6

Conversation

pauldreik commented Jan 14, 2024

jwakely Jan 16, 2024

Choose a reason for hiding this comment

pauldreik Jul 27, 2024

Choose a reason for hiding this comment

jwakely Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

pauldreik Jul 27, 2024

Choose a reason for hiding this comment

pauldreik commented Jul 27, 2024

jwakely commented Jul 27, 2024

jwakely Jan 16, 2024 •

edited

Loading