Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Japanese numerals #228

Closed
4 of 5 tasks
Intelligent2013 opened this issue Oct 20, 2024 · 39 comments
Closed
4 of 5 tasks

Support Japanese numerals #228

Intelligent2013 opened this issue Oct 20, 2024 · 39 comments
Assignees
Labels
enhancement New feature or request

Comments

@Intelligent2013
Copy link
Contributor

Intelligent2013 commented Oct 20, 2024

Source issue: #226

Support Japanese numerals in

  • clause numbers
    Example:
    image

  • ordered list items
    Example:
    image

  • edition number
    currently, there are two elements in the Presentation XML:

<edition language="">1</edition>
<edition language="ja">第1版</edition>
  • publication date
    Example: 令和元年七月二十二日
    Current Presentation XML: <date type="published">令和元年7月22日</date>

If this task is complicated, then I'll find how to do this via XSLT extensions on Java.

@ronaldtse does we need to support two number formats - Arabic (1, 2, 3, ...) for usual documents and Japanese (一, ...) for vertical layout documents? Or only Japanese numbers?

Note: I don't know the reason, but the notes numbers should be Arabic:
image

UPDATE after the comment

@Intelligent2013 I just noticed this since @opoudjis raised it. They are meant to be in Japanese numerals too.

  • notes, examples numbers
@ReesePlews
Copy link

very interesting to see the vertical layout. thanks for all the work on this @Intelligent2013 ! i dont work with vertical layout much but the third image above looks more correct than the second image. the layout of the kanji numbers in the first image appears correct for the main clause numbers, but with the sub-clause numbering, the vertical style of '三・一' etc seems different to me... i guess, in theory, that is the correct style but seems a bit difficult on the eyes; again i dont have enough experience with vertical layout. i suspect that vertical layout is widely used by such agencies as the justice ministry (法務省) and the writing of japanese laws/regulations. i know there is a large legal website that has japanese laws with english translations, but off hand i dont remember the link. they may have samples of printed works online that could be helpful in these cases.

@ronaldtse
Copy link

ronaldtse commented Oct 21, 2024

Thank you @ReesePlews ! Yes you are right that the Japanese "e-Gov" website has all the Japanese laws.

For example, this is the Constitution of Japan:

For vertical layout, they have 3 options: 1 column, 2 columns and 4 columns
Screenshot 2024-10-21 at 10 02 13 AM

This is the law that establishes JIS:

For space savings, this is a screenshot of the 4 column (so it's not too tall to show here).
Screenshot 2024-10-21 at 10 08 00 AM

It uses the list style:

  • 1, 2...
  • 一, 二, ...
  • イ, ロ, ...
  • (1), (2)...
  • (i), (ii)...

The list style only uses a single full width space indentation to separate list levels.

UPDATE: It seems that when Paragraphs are labeled, in the e-Gov website the paragraph label for the first paragraph is omitted, and subsequent paragraph labels exist. Not sure why the list item "1" is missing though. This doesn't seem to be an East Asian tradition.

@Intelligent2013
Copy link
Contributor Author

The 1st post updated - added 'edition number'.

@opoudjis
Copy link
Contributor

opoudjis commented Oct 22, 2024

There's two elements to this.

The first is to support Japanese numerals, and I can do that, sure: that's merely 2.localize(:ja).spellout, using twitter_cldr.

The second is to work out where to use Japanese numerals instead of Arabic numerals. This should not be being done on an ad hoc basis, and it should not be being done independently in HTML and PDF: there needs to be a rule as to where it happens, and it needs to be done in Presentation XML.

I have the bad feeling that this is going to end up as a document attribute.

@ronaldtse
Copy link

ronaldtse commented Oct 22, 2024

I have the bad feeling that this is going to end up as a document attribute.

You mean the specification of list bullet styles per level being configurable? I'd (everyone would) love that.

@opoudjis
Copy link
Contributor

I don't even know if I can do that in HTML. Not without a lot of pain.

And you need to say a lot more about where Japanese numbers are meant to show up. Numbering is done in code; I can make the xref counter output Japanese instead of Arabic numerals, but that means initialising each counter instance in isodoc, one for every block type and clause (figures, tables, requirements, etc etc etc).

Without a coherent statement, you are not getting anything.

@ronaldtse
Copy link

Note: I don't know the reason, but the notes numbers should be Arabic:

@Intelligent2013 I just noticed this since @opoudjis raised it. They are meant to be in Japanese numerals too.

@opoudjis
Copy link
Contributor

opoudjis commented Oct 22, 2024

You mean the specification of list bullet styles per level being configurable? I'd (everyone would) love that.

PER LEVEL?! No you are not getting random list level specification PER LEVEL. ISO HTML CSS has 30 lines of custom code just to insert ")" after list numbers.
metanorma/isodoc#247 has been unactioned for the past four years because of how horrible Word HTML is about custom list numbering.

No, what you're going to get is:

  • A document attribute specifying whether Japanese or Arabic auto-numbering is to be used in the document. I am not going to be supporting vague notions of new flavours or document types: I am yet to see evidence that there is a coherent mapping of Japanese numbering to document type or organisation at all, and I'm not going to wait for one.
  • Restriction of Japanese number styling to clauses, ordered lists, and edition numbers. Each and every numbering counter is a separate variable, and if any one of them outputs Arabic, they need to be set individually. I am not at this time going to assume that Japanese numbering is used for all autonumbering in the document, for the simple reason that the sample document does not, and it is not our place to dictate to people what numbers they use universally.

Ordered lists will rely on the Presentation XML feature of //ol/li/@label to tell the consumer what to put in the list. This will only work out of the box for PDF, and there is code from other flavours that can make it work for DOC; HTML would need CSS overriding to make it work.

I am considering this nothing more than a proof of concept.

@opoudjis
Copy link
Contributor

opoudjis commented Oct 22, 2024

I'm going to realise this with the document attribute

:presentation-metadata-japanese-numbering: true

@opoudjis
Copy link
Contributor

opoudjis commented Oct 22, 2024

@ronaldtse wants to generalise this to Arabic, Chinese, and Amharic.

I have little inclination to do so, and this does not address the very real problem of what types of block are going to be Arabic and what local.

But:

:presentation-metadata-autonumbering-style: japanese

The nightmare scenario is:

:presentation-metadata-notes-autonumbering-style: arabic
:presentation-metadata-clause-autonumbering-style: japanese
:presentation-metadata-subclause-autonumbering-style: arabic

I will not be implementing that.

@opoudjis
Copy link
Contributor

To make counters more configurable, I'm going to eventually set up configuration of all counters—starting value and style. But for now, I'm only going to expose that for clauses and lists.

@opoudjis
Copy link
Contributor

opoudjis commented Oct 22, 2024

I've got a problem: I want to assign config to counter classes based on config in the xref class (which knows about numbering styles from the Presentation XML metadata), but I don't want to redefine all the classes invoking them.

So to exploit inheritance, I'm going to have to define these counter classes with methods invoked from the xref class.

opoudjis added a commit to metanorma/isodoc that referenced this issue Oct 22, 2024
opoudjis added a commit to metanorma/isodoc that referenced this issue Oct 22, 2024
opoudjis added a commit that referenced this issue Oct 22, 2024
@opoudjis
Copy link
Contributor

Not working yet...

@Intelligent2013
Copy link
Contributor Author

Also we need to support Japanese numerals in the publication date. I've updated the initial post.

@opoudjis
Copy link
Contributor

opoudjis commented Oct 23, 2024

I am providing Japanese numbering in the Presentation XML, but there is a nightmare scenario where you provide Japanese numbering for page numbers. If you do need them, and if XSL:FO is not clever enough to do that automatically, I'll need to dump the numbers 1–1,000 in the localization strings. Let's not action that yet though... I'd be surprised if XSL:FO doesn't provide that natively somewhere.

@Intelligent2013
Copy link
Contributor Author

I am providing Japanese numbering in the Presentation XML, but there is a nightmare scenario where you provide Japanese numbering for page numbers. If you do need them, and if XSL:FO is not clever enough to do that automatically, I'll need to dump the numbers 1–1,000 in the localization strings. Let's not action that yet though... I'd be surprised if XSL:FO doesn't provide that natively somewhere.

@opoudjis Apache FOP has the extension fox:number-conversion-features (https://xmlgraphics.apache.org/fop/2.0/complexscripts.html#source), but looks like it's not working at all, may be I try something wrong... For any case, let's dump the numbers 1–1,000 in the localization strings when you have a time. The page numbers changing should be applied in IF (Intermedia Format) after XSL-FO generation.

@opoudjis
Copy link
Contributor

We need to localise the clause number delimiter, from half-width to full-width full stop, if Japanese numbering is used.

And I'm going to use this as the opportunity to implement a fix to CJK punctuation called on in relaton/relaton-render#52, which I have not implemented to date because of @ronaldtse ’s indefensible notion that

Johnson、 A。、 Peters、 B。 1976。 The origins of sound 【series】。 London〯Blackwells

is desirable punctuation.

It is not, I reject with utmost vehemence any claim that it is (and so has Reese) and I am pressing ahead with the correct solution.

Regardless of the document main language, punctuation localisation will convert punctuation from half-width to full-width only if at the characters on either side are CJK.

So:

  • All clause numbers will now be subject to punctuation localisation.
  • Regardless of the language of the document, a clause number like "2.1" will ABSOLUTELY NOT be converted to "2。1", because that is insane, and makes me look incompetent.
  • The clause number "二.一" will however be converted to "二。一", because the dot is surrounded by CJK characters.
  • Annex number "A.一" will not be converted to "A。一"

I am also going to bite the bullet and move Japanese number rendering to isodoc for xref counters; they already support Roman at top level.

@opoudjis
Copy link
Contributor

@Intelligent2013 The edition numbering works in testing, so I will need to investigate that. The list numbering will also be complicated.

@opoudjis
Copy link
Contributor

opoudjis commented Oct 24, 2024

Reese, the point of what I have written is the following:

  • Automated text generation in Metanorma uses Latin punctuation
  • Latin punctuation in CJK text needs to be switched to full-width punctuation, if it is automated text
  • But not if the Latin punctuation is adjacent to Latin text
  • If users actually want CJK punctuation inside Latin text (which Ronald seems to think they do), then it needs to be set as such in the outset: CJK punctuation will not be converted back to Latin
  • My use of "Code" is a random example. Try, more to the point:

二.二 => 二。二 ( although it looks like I will need to override this with middle-dot anyway)
A.2 => A.2 (unchanged; previously it would have attempted A。2)

@ronaldtse
Copy link

@opoudjis the Japanese "middle dot" delimiter is not the "full stop", they are different symbols.

@ronaldtse
Copy link

ronaldtse commented Oct 24, 2024

If users actually want CJK punctuation inside Latin text (which Ronald seems to think they do), then it needs to be set as such in the outset: CJK punctuation will not be converted back to Latin

No, that's not what I asked for. The default for bibliographic entries is to be rendered in a suitable style, i.e. English in English, Japanese in Japanese. We could have Japanese in English or English in Japanese but that should not be the default.

@opoudjis
Copy link
Contributor

opoudjis commented Oct 24, 2024

Bibliographic entries will routinely be mixed-language, with things like Japanese authors and English titles. The notion of a bibliographic entry being "just Japanese" or "just English" is naive and inflexible. It is also is a nuisance on top of trying to work out what the language of a bibliographic entry is to begin with. (You think users are going to be marking it up as [lang=ja]? And then mark up titles individually as exceptions? When we can work out the script automatically through Regex?)

That's why working out whether to apply CJK punctuation contextually, rather than based solely on a language tag, has ALWAYS been the right way to proceed, and I am proceeding with it.

Rereading, the default is indeed going to be CJK, but it will be overridden when the immediate context shows that full-width punctuation makes no sense (the surrounding characters are Latin). And I simply cannot trust users to exhaustively mark up references (let alone individual bits of references) to indicate language explicitly.

@opoudjis
Copy link
Contributor

@opoudjis the Japanese "middle dot" delimiter is not the "full stop", they are different symbols.

As I have just acknowledged, which is why I am doing the refactoring.

opoudjis added a commit to metanorma/isodoc that referenced this issue Oct 24, 2024
@opoudjis
Copy link
Contributor

From a.presentation.xml: 1第1版

You're looking at the wrong file: I am generating

<edition language="">1</edition><edition language="ja">第一版</edition>

in the Japanese numbering version. You'll have a refresh soon.

opoudjis added a commit that referenced this issue Oct 24, 2024
@opoudjis
Copy link
Contributor

ordered list items

This is an update to JIS. JIS has Alphabetic numbering on its first level of ordered lists, and Arabic numbering on subsequent levels. I don't know what the provenance of the PDF sample is, and I do not care: I am not overriding JIS list numbering for some unasked-for proof of concept. I am implementing Japanese numbering to replace Arabic numbering in ordered lists ONLY where JIS sanctions that.

opoudjis added a commit to metanorma/isodoc that referenced this issue Oct 24, 2024
opoudjis added a commit to metanorma/isodoc that referenced this issue Oct 24, 2024
opoudjis added a commit to metanorma/isodoc that referenced this issue Oct 24, 2024
opoudjis added a commit to metanorma/isodoc that referenced this issue Oct 24, 2024
@opoudjis
Copy link
Contributor

As warned: HTML right now has no idea what to do with custom list labels.

@Intelligent2013 The following should have now everything you need for this proof of concept.

Archive.zip

@Intelligent2013
Copy link
Contributor Author

You're looking at the wrong file: I am generating

<edition language="">1</edition><edition language="ja">第一版</edition>

in the Japanese numbering version. You'll have a refresh soon.

Ok. please note I need just without around it. And we need to keep the value 第1版 for current (not-vertical) layout.
I.e. like this <edition language="">1</edition><edition language="ja">第1版</edition><edition language="ja" numberonly="true">一</edition>.

@opoudjis
Copy link
Contributor

Yuck, that's really adhoc. OK...

opoudjis added a commit that referenced this issue Oct 24, 2024
@opoudjis
Copy link
Contributor

@Intelligent2013 Here you go.

Archive 2.zip

@Intelligent2013
Copy link
Contributor Author

Ordered lists look ok:
image

Thanks!

Now, testing edition number....

@Intelligent2013
Copy link
Contributor Author

@opoudjis the edition number is ok also. Thanks!

I've updated the initial post for notes, examples numbers:

Note: I don't know the reason, but the notes numbers should be Arabic:

@Intelligent2013 I just noticed this since @opoudjis raised it. They are meant to be in Japanese numerals too.

@opoudjis
Copy link
Contributor

@opoudjis the edition number is ok also. Thanks!

I've updated the initial post for notes, examples numbers:

Note: I don't know the reason, but the notes numbers should be Arabic:

@Intelligent2013 I just noticed this since @opoudjis raised it. They are meant to be in Japanese numerals too.

I will not be actioning this at this time, because I need evidence that clients actually want this behaviour, and I am reasonably sure they won't be consistent about it.

@opoudjis opoudjis moved this from 🏗 In progress to 👀 In review in Metanorma Oct 25, 2024
@opoudjis
Copy link
Contributor

So, rather than get into a protracted discussion:

I am closing this ticket as complete.

The additional requirement stated for custom numbering of notes, examples, requirements, formulas, term notes, term examples, annexes, admonitions, ordered lists (as distinct from list items), definition lists, figures, subfigures, tables, could be satisfied in one of two ways:

  • Blanket Japanese numbering of all of them. That is not what the PDF document is doing, and without a written statement from a Japanese client saying that is what they want, I will not implement it, and the assumption that we can ignore what agencies have actually done in their editorial practice is unacceptable.
  • Customisation of all fourteen classes of counter, because we never know which preference any particular agency is going to go with, and we have no reason to think there is any consistency between them.

The second approach is the only respectful way to engage with customers. It is also 200-300 lines of code for what is, at this stage, a proof of concept that nobody external has actually asked for, and that no external agency is exercising QA over.

It is therefore not going to be a priority for me to work on until some agency actually does ask for it, and can articulate authoritatively how whether they want each of their notes, examples, requirements, formulas, term notes, term examples, annexes, admonitions, ordered lists (as distinct from list items), definition lists, figures, subfigures, tables to be numbered Japanese or Arabic.

I will create a ticket for this, and I will demote it to medium priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

4 participants