Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify where the 8-bit/Kanji split should be expected #9

Open
josephwright opened this issue Feb 9, 2023 · 5 comments
Open

Clarify where the 8-bit/Kanji split should be expected #9

josephwright opened this issue Feb 9, 2023 · 5 comments

Comments

@josephwright
Copy link

Testing the input parsing of upTeX, assuming \enablecjktoken is set, I find that codepoints U+0000 to U+0249 are handled as 8-bit characters (so I get either one or two bytes), and from U+0250 I get Kanji interpretation (a single character read). With \disablecjktoken, everything is handled as 8-bit as far as I tested.

This seems to be somewhat at odds with the description of e.g. \Uchar in the English guide, and I am not sure exactly where the behaviour is documented. I assume it's deliberate as codepoint U+0249 is not particularly notable other than falling at a convenient place to split in hexadecimal.

(This links to latex3/latex3#1171, which arises basically as for testing against pdfTeX, we currently set \disablecjktoken and assume 8-bit behaviour across the board for upTeX. Before any changes, I'd like to be clear what actually is the expectation in terms of character ranges.)

@aminophen
Copy link
Member

aminophen commented Feb 9, 2023

I find that codepoints U+0000 to U+0249 are handled as 8-bit characters (so I get either one or two bytes), and from U+0250 I get Kanji interpretation (a single character read).

Not exactly; the classification between "8-bit characters" and "Kanji interpretation" is based on \kcatcode setting. (that means, some ranges higher than U+0250 are handled as "8-bit characters") If \kcatcode is 15, it is handled as "8-bit characters"; otherwise "Kanji interpretation"

\Uchar is not a upTeX primitive but a e-pTeX/e-upTeX primitive, so is described in eptexdoc.pdf only in Japanese. In e-pTeX and e-upTeX, \Uchar always generates a single character token of "Kanji interpretation" for charcode beyond 0xFF.

Before any changes, I'd like to be clear what actually is the expectation in terms of character ranges.

We hope that a single character in "Kanji interpretation" are simply passed as-is (without casefolding).

@aminophen
Copy link
Member

aminophen commented Feb 9, 2023

Please note that, we Japanese team already know (for years !) that text handling of expl3 is not working properly for "Kanji interpretation" tokens at all (sometimes weird errors, sometimes broken text, etc) but the fact that "explaining the complicated situation of Japanese TeX engine to you Western developers is very difficult" is preventing us for reporting. Sorry for inconvenience...

@josephwright
Copy link
Author

I find that codepoints U+0000 to U+0249 are handled as 8-bit characters (so I get either one or two bytes), and from U+0250 I get Kanji interpretation (a single character read).

Not exactly; the classification between "8-bit characters" and "Kanji interpretation" is based on \kcatcode setting. (that means, some ranges higher than U+0250 are handled as "8-bit characters") If \kcatcode is 15, it is handled as "8-bit characters"; otherwise "Kanji interpretation"

OK, I think I can handle that. I guess for codepoints > U+00FF I need to check the \kcatcode and branch.

@josephwright
Copy link
Author

Please note that, we Japanese team already know (for years !) that text handling of expl3 is not working properly for "Kanji interpretation" tokens at all (sometimes weird errors, sometimes broken text, etc) but the fact that "explaining the complicated situation of Japanese TeX engine to you Western developers is very difficult" is preventing us for reporting. Sorry for inconvenience...

I knew is was a bit of a hack, as at the time of originally writing the code the aim was to avoid differences from pdfTeX. I've now got better Unicode coverage there so this is a good time to revisit the assumptions. I think I see what's needed: I will write myself some tests.

@josephwright
Copy link
Author

Turns out I can handle this without needing to test \kcatcode: testing if the first tokens is >"FF avoids an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants