Please clarify String byte-encoding format #18

timmc · 2014-08-14T01:33:21Z

Are strings encoded in UTF-8, UTF-16, UCS-2, or what? For example, how would the String value "𐀀" (U+10000) be encoded in the various formats? (That is, what character encoding is used for Transit when encoding JSON to bytes?)

(Edit: Removed further question about illegal bytes in raw-string inputs in various languages.)

(Sorry for the thrashing -- I'm reopening this now that I see that yes, Github Issues are in fact being used for this project.)

The text was updated successfully, but these errors were encountered:

jlouis · 2014-08-14T13:29:14Z

The question is actually: "is there a default encoding of the document or is this negotiable?" In practice, is it nailed to be UTF-8 (in which case you need surrogate pairs to represent U+10000 and on outside the BMP), or is it something else.

timmc · 2014-08-14T14:27:35Z

There are actually two questions, one of which I think is already answered:

When a String value's characters are expressed in JSON, what encoding is used? (Answer from spec: No encoding needed for non-ASCII, but UTF-16 is to be used for any unicode character escapes.)
When the JSON is then written to a byte-oriented medium, what encoding is used for the character->bytes conversion?

The JSON spec suggests using UTF-8, but it doesn't demand it. I think it would be appropriate for Transit to lock this down so that we don't get nasty character encoding issues between platforms with different system defaults (e.g. Windows-1252 in Windows with English locales.)

(As for MessagePack, a quick glance suggests it already specifies an encoding of UTF-8.)

timmc closed this as completed Aug 14, 2014

timmc reopened this Aug 14, 2014

stuartsierra mentioned this issue Jul 19, 2016

Clarify that JSON Encoding uses UTF-8 cognitect/transit-java#14

Open

marci4 mentioned this issue Oct 3, 2017

String decoding problem TooTallNate/Java-WebSocket#566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please clarify String byte-encoding format #18

Please clarify String byte-encoding format #18

timmc commented Aug 14, 2014

jlouis commented Aug 14, 2014

timmc commented Aug 14, 2014

Please clarify String byte-encoding format #18

Please clarify String byte-encoding format #18

Comments

timmc commented Aug 14, 2014

jlouis commented Aug 14, 2014

timmc commented Aug 14, 2014