-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CJK sorting is based on unicode code points #259
Comments
The CSL spec doesn't seem to enforce a specific standard, although sorting by codepoint is probably a bad default for non-Latin script languages. Though an initial search reveals that there are multiple standards for sorting CJK characters (and also Chinese, Japanese and Korean characters separately). Romanization is one of them, though I've also seen mentions of character form-based sorting. I wonder if we should add some way to support those different sorting options, or if we could at least settle on an initial solution of just changing the default to use one of them (e.g. romanization) and ensuring you can still specify your own order (which can be done through the CSL style). |
I've discussed that with Chinese colleagues and it seems like the common thing to do is to romanize before sorting, so I'd consider current behaviour a bug. (Could be mistaken, so probably best to check with other people) I think we can introduce unicode sorting later on if there is demand (and alternative sorting options). |
This problem can be handled by Unicode Collation Algorithm and there should be several implementations in Rust like https://github.com/unicode-org/icu4x. The sorting methods for Chinese are defined in https://github.com/unicode-org/cldr/blob/main/common/collation/zh.xml. |
When the CSL requires author-date sorting, e.g., gb-7714-2015-author-date, then characters need to be romanized before sorting, otherwise the default is sorting by code points.
Discord thread
EDIT: Probably identical issue could occur for non-latin script languages
The text was updated successfully, but these errors were encountered: