-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify where the 8-bit/Kanji split should be expected #9
Comments
Not exactly; the classification between "8-bit characters" and "Kanji interpretation" is based on \kcatcode setting. (that means, some ranges higher than U+0250 are handled as "8-bit characters") If \kcatcode is 15, it is handled as "8-bit characters"; otherwise "Kanji interpretation" \Uchar is not a upTeX primitive but a e-pTeX/e-upTeX primitive, so is described in eptexdoc.pdf only in Japanese. In e-pTeX and e-upTeX, \Uchar always generates a single character token of "Kanji interpretation" for charcode beyond 0xFF.
We hope that a single character in "Kanji interpretation" are simply passed as-is (without casefolding). |
Please note that, we Japanese team already know (for years !) that text handling of expl3 is not working properly for "Kanji interpretation" tokens at all (sometimes weird errors, sometimes broken text, etc) but the fact that "explaining the complicated situation of Japanese TeX engine to you Western developers is very difficult" is preventing us for reporting. Sorry for inconvenience... |
OK, I think I can handle that. I guess for codepoints > U+00FF I need to check the |
I knew is was a bit of a hack, as at the time of originally writing the code the aim was to avoid differences from pdfTeX. I've now got better Unicode coverage there so this is a good time to revisit the assumptions. I think I see what's needed: I will write myself some tests. |
Turns out I can handle this without needing to test |
Testing the input parsing of upTeX, assuming
\enablecjktoken
is set, I find that codepoints U+0000 to U+0249 are handled as 8-bit characters (so I get either one or two bytes), and from U+0250 I get Kanji interpretation (a single character read). With\disablecjktoken
, everything is handled as 8-bit as far as I tested.This seems to be somewhat at odds with the description of e.g.
\Uchar
in the English guide, and I am not sure exactly where the behaviour is documented. I assume it's deliberate as codepoint U+0249 is not particularly notable other than falling at a convenient place to split in hexadecimal.(This links to latex3/latex3#1171, which arises basically as for testing against pdfTeX, we currently set
\disablecjktoken
and assume 8-bit behaviour across the board for upTeX. Before any changes, I'd like to be clear what actually is the expectation in terms of character ranges.)The text was updated successfully, but these errors were encountered: