Clarify where the 8-bit/Kanji split should be expected #9

josephwright · 2023-02-09T07:43:31Z

Testing the input parsing of upTeX, assuming \enablecjktoken is set, I find that codepoints U+0000 to U+0249 are handled as 8-bit characters (so I get either one or two bytes), and from U+0250 I get Kanji interpretation (a single character read). With \disablecjktoken, everything is handled as 8-bit as far as I tested.

This seems to be somewhat at odds with the description of e.g. \Uchar in the English guide, and I am not sure exactly where the behaviour is documented. I assume it's deliberate as codepoint U+0249 is not particularly notable other than falling at a convenient place to split in hexadecimal.

(This links to latex3/latex3#1171, which arises basically as for testing against pdfTeX, we currently set \disablecjktoken and assume 8-bit behaviour across the board for upTeX. Before any changes, I'd like to be clear what actually is the expectation in terms of character ranges.)

The text was updated successfully, but these errors were encountered:

aminophen · 2023-02-09T11:31:09Z

I find that codepoints U+0000 to U+0249 are handled as 8-bit characters (so I get either one or two bytes), and from U+0250 I get Kanji interpretation (a single character read).

Not exactly; the classification between "8-bit characters" and "Kanji interpretation" is based on \kcatcode setting. (that means, some ranges higher than U+0250 are handled as "8-bit characters") If \kcatcode is 15, it is handled as "8-bit characters"; otherwise "Kanji interpretation"

\Uchar is not a upTeX primitive but a e-pTeX/e-upTeX primitive, so is described in eptexdoc.pdf only in Japanese. In e-pTeX and e-upTeX, \Uchar always generates a single character token of "Kanji interpretation" for charcode beyond 0xFF.

Before any changes, I'd like to be clear what actually is the expectation in terms of character ranges.

We hope that a single character in "Kanji interpretation" are simply passed as-is (without casefolding).

aminophen · 2023-02-09T11:35:18Z

Please note that, we Japanese team already know (for years !) that text handling of expl3 is not working properly for "Kanji interpretation" tokens at all (sometimes weird errors, sometimes broken text, etc) but the fact that "explaining the complicated situation of Japanese TeX engine to you Western developers is very difficult" is preventing us for reporting. Sorry for inconvenience...

josephwright · 2023-02-09T14:01:21Z

I find that codepoints U+0000 to U+0249 are handled as 8-bit characters (so I get either one or two bytes), and from U+0250 I get Kanji interpretation (a single character read).

Not exactly; the classification between "8-bit characters" and "Kanji interpretation" is based on \kcatcode setting. (that means, some ranges higher than U+0250 are handled as "8-bit characters") If \kcatcode is 15, it is handled as "8-bit characters"; otherwise "Kanji interpretation"

OK, I think I can handle that. I guess for codepoints > U+00FF I need to check the \kcatcode and branch.

josephwright · 2023-02-09T14:02:30Z

Please note that, we Japanese team already know (for years !) that text handling of expl3 is not working properly for "Kanji interpretation" tokens at all (sometimes weird errors, sometimes broken text, etc) but the fact that "explaining the complicated situation of Japanese TeX engine to you Western developers is very difficult" is preventing us for reporting. Sorry for inconvenience...

I knew is was a bit of a hack, as at the time of originally writing the code the aim was to avoid differences from pdfTeX. I've now got better Unicode coverage there so this is a good time to revisit the assumptions. I think I see what's needed: I will write myself some tests.

josephwright · 2023-02-13T14:19:37Z

Turns out I can handle this without needing to test \kcatcode: testing if the first tokens is >"FF avoids an error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify where the 8-bit/Kanji split should be expected #9

Clarify where the 8-bit/Kanji split should be expected #9

josephwright commented Feb 9, 2023

aminophen commented Feb 9, 2023 •

edited

Loading

aminophen commented Feb 9, 2023 •

edited

Loading

josephwright commented Feb 9, 2023

josephwright commented Feb 9, 2023

josephwright commented Feb 13, 2023

Clarify where the 8-bit/Kanji split should be expected #9

Clarify where the 8-bit/Kanji split should be expected #9

Comments

josephwright commented Feb 9, 2023

aminophen commented Feb 9, 2023 • edited Loading

aminophen commented Feb 9, 2023 • edited Loading

josephwright commented Feb 9, 2023

josephwright commented Feb 9, 2023

josephwright commented Feb 13, 2023

aminophen commented Feb 9, 2023 •

edited

Loading

aminophen commented Feb 9, 2023 •

edited

Loading