Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please clarify String byte-encoding format #18

Open
timmc opened this issue Aug 14, 2014 · 2 comments
Open

Please clarify String byte-encoding format #18

timmc opened this issue Aug 14, 2014 · 2 comments

Comments

@timmc
Copy link

timmc commented Aug 14, 2014

Are strings encoded in UTF-8, UTF-16, UCS-2, or what? For example, how would the String value "𐀀" (U+10000) be encoded in the various formats? (That is, what character encoding is used for Transit when encoding JSON to bytes?)

(Edit: Removed further question about illegal bytes in raw-string inputs in various languages.)

(Sorry for the thrashing -- I'm reopening this now that I see that yes, Github Issues are in fact being used for this project.)

@timmc timmc closed this as completed Aug 14, 2014
@timmc timmc reopened this Aug 14, 2014
@jlouis
Copy link

jlouis commented Aug 14, 2014

The question is actually: "is there a default encoding of the document or is this negotiable?" In practice, is it nailed to be UTF-8 (in which case you need surrogate pairs to represent U+10000 and on outside the BMP), or is it something else.

@timmc
Copy link
Author

timmc commented Aug 14, 2014

There are actually two questions, one of which I think is already answered:

  1. When a String value's characters are expressed in JSON, what encoding is used? (Answer from spec: No encoding needed for non-ASCII, but UTF-16 is to be used for any unicode character escapes.)
  2. When the JSON is then written to a byte-oriented medium, what encoding is used for the character->bytes conversion?

The JSON spec suggests using UTF-8, but it doesn't demand it. I think it would be appropriate for Transit to lock this down so that we don't get nasty character encoding issues between platforms with different system defaults (e.g. Windows-1252 in Windows with English locales.)

(As for MessagePack, a quick glance suggests it already specifies an encoding of UTF-8.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants