generated from ivoa-std/doc-template
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update VOTable to handle UTF-8 #55
Comments
(1) is deliberately awkward. Making the restriction glaringly obvious in the DALI document prevents us from endorsing |
On Fri, Dec 15, 2023 at 08:03:40AM -0800, Zarquan wrote:
Using the current VOTable standard, some of the UTF-8 characters
may end up being truncated to fit into the UTF-2 character set.
Which is not the expected behaviour.
First off, I'd truly like to get rid of VOTable's UCS-2 legacy (it's
been obsolete for ages), too. But given that unicodeChar is a bit of
an oddity, I don't think we want to do a non-compatible
(major-version-pushing) VOTable change just because of this.
But then we shouldn't be using VOTable Unicode encodings for JSON
anyway.
To resolve this:
1. Any changes to the DALI documents that propose `xtype="json"`
MUST include a caveat in the text that explicitly restricts the
JSON content to the UTF-2 character set.
No, we should say that people use char and avoid unicodeChar for JSON
(I'd probably even forbid unicodeChar). JSON is designed so it *can*
work with pure ASCII, and we should make that a *must* in order to
not paint us into the ugly UCS-2 corner of unicodeChar. RFC 8259
says:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A through
F can be uppercase or lowercase. So, for example, a string
containing only a single reverse solidus character may be represented
as "\u005C".
2. We work to develop a new version of the VOTable standard which
includes support for the full UTF-8 character set.
I'll not stop you, but note that in the entire metadata part you
*can* already use whatever unicode you want, it's just in unicodeChar
FIELD data that you're not allowed to (and that you can't in
BINARY(2)).
If you ask me: We should just allow UTF-8 in char[*] BINARY2 fields,
use native encoding in char[*] TABLEDATA and deprecate unicodeChar
(and BINARY, but that's tangential).
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As part of the group looking at updating our standards to be compatible with 2020 technologies, I propose that we update the VOTable standard to handle the full UTF-8 characters set.
Issue DALI#33 is looking at adding support for
xtype="json"
.If we do adopt this new xtype, it allows a client to create a VOTable column with
datatype="unicodeChar", arraysize="*", xtype="json"
.This implies that the client can populate this column with ANY valid JSON document and upload it to a TAP service. Including JSON content that contains UTF-8 characters.
Using the current VOTable standard, some of the UTF-8 characters may end up being truncated to fit into the UTF-2 character set. Which is not the expected behaviour.
To resolve this:
xtype="json"
MUST include a caveat in the text that explicitly restricts the JSON content to the UTF-2 character set.The text was updated successfully, but these errors were encountered: