Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming code unit sequences by saving incomplete code unit sequences as encoding state #15

Open
tahonermann opened this issue Feb 16, 2016 · 4 comments

Comments

@tahonermann
Copy link
Owner

tahonermann commented Feb 16, 2016

Consuming code unit sequences from a streaming source may result in attempts to decode a partial code unit sequence. At present, an exception will be thrown when such underflow occurs. An alternative would be to store the partial code unit sequence in the iterator state and then have the iterator compare equally to the end iterator. This would enable code like the following to work correctly even if buffer ends fail to fall on a code unit sequence boundary.

using encoding = utf8_encoding;
auto state = encoding::initial_state();
do {
   std::string b = get_more_data();
   auto tv = make_text_view<utf8_encoding>(state, begin(b), end(b));
   auto tv_it = begin(tv);
   while (tv_it != end(tv))
     ...;
   state = tv_it;  // Trailing state is in tv_it, preserve it
                   // to seed state for the next iteration.
} while(!b.empty());

A problem with this approach is that it leaves open the possibility for trailing code units (e.g., garbage at the end of the encoded text) to go unnoticed. Because of this, the behavior above probably shouldn't be the default behavior, but it should be possible for code to opt in to it; perhaps via a policy class as suggested in #14.

@ruoso
Copy link

ruoso commented Sep 23, 2016

I have been thinking about this topic (wrote this two things: https://github.com/ruoso/u5e/blob/master/StreamVsIterators.md
https://github.com/ruoso/u5e/blob/master/StreamVsFormat.md )

I believe it's best if there is a more clear "firewall" between raw data and text. The code handling the specific streamed protocol (such as HTTP or IRC for instance) is in a much better position to validate the data before 'declaring' it to be text. Doing that in the iterator itself creates an undue burden on everyone handling that type of code.

@tahonermann
Copy link
Owner Author

I agree that ensuring proper data boundaries in packet oriented protocols is best practice. I think there will always be cases where that isn't possible though. In those cases, the only solutions I've found so far are for the iterator to throw an exception, block (on advancement of the underlying code unit iterator), or the approach described in the first comment of this issue.

The initial email thread where I requested feedback on text_view talked about some of these options. You can find it at: https://groups.google.com/a/isocpp.org/d/msg/std-proposals/Tu84_TQOlhc/lV0MdIq1HQAJ

@ruoso
Copy link

ruoso commented Sep 25, 2016

My point is that introducing that support is counter productive. It is a
use case that only makes sense from a theoretical standpoint.

In practice, the industry consensus is that the only reasonable way to
handle the distinction between what is semantically considered "text" and
what is a "sequence of bytes" is by creating a strict type-safe firewall
between code that handles text and code that handles bytes.

Any library support that weakens that firewall not only is not useful
(since the network layer does need to be byte-by-byte precise), but it is
actually harmful (because it leads developers into thinking it's possible
to send "text" over a socket, when reality is way more complicated than
that).

Em sáb, 24 de set de 2016 22:07, Tom Honermann [email protected]
escreveu:

I agree that ensuring proper data boundaries in packet oriented protocols
is best practice. I think there will always be cases where that isn't
possible though. In those cases, the only solutions I've found so far are
for the iterator to throw an exception, block (on advancement of the
underlying code unit iterator), or the approach described in the first
comment of this issue.

The initial email thread where I requested feedback on text_view talked
about some of these options. You can find it at:
https://groups.google.com/a/isocpp.org/d/msg/std-proposals/Tu84_TQOlhc/lV0MdIq1HQAJ


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAE9K3iJ8xC-7KZEN0babtqN4xeMl3WLks5qtddHgaJpZM4HaxGH
.

@tahonermann
Copy link
Owner Author

I think there are legitimate use cases. People stream text across command line pipes all the time. Granted, blocking and data loss tend not to be issues in those cases.

At any rate, addressing this issue is not high on my priority list. This issue was opened due to concerns raised in the email thread mentioned in #15 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants