properties: add "ambiwidth" property for ambiguous East Asian Width #270

bfredl · 2024-08-12T14:00:42Z

Some characters have their width defined as "Ambiguous" in UAX#11. These are typically rendered as single-width by modern monospace fonts, and utf8proc correctly returns charwidth==1 for these.

However some applications might need to support older CJK fonts where two-byte characters in legacy encodings were rendered as double-width. An example of this is the 'ambiwidth' option of vim and neovim which supports rendering in terminals using such wideness rules.

Add an 'ambiwidth' property to utf8proc_property_t for such characters, by using a previously unused padding bit.

alternatives

set charwidth==3 for such characters (which are not zero-width), which is presently unused. Would be too much of a breaking change for existing consumers, I think.
return the full set of EAW classes (W, F, N, H, Na, A). Could be more future-proof if some consumers need this info, but would require more space usage.

stevengj · 2024-08-12T19:30:15Z

older CJK fonts where two-byte characters in legacy encodings were rendered as double-width

If this is font-dependent, it doesn't seem like something you can infer from codepoint alone?

I'm a little confused about how people would use this new property in practice.

bfredl · 2024-08-12T20:33:20Z

Sure but it is font dependent either way. currently utf8proc represents all (non-zero) ambiguous width chars as single width, which is a fine first approximation but not guaranteed to be correct either. Knowing which chars are considered to be ambiguous allows apps to treat these more carefully, i e in a TUI you could reposition the cursor after each such codepoint to make sure the TUI and terminal emulator cursors are in sync regardless of the actual width in the user's font.

More specifically, this was motivated by ongoing work in neovim to migrate all unicode table lookups to use utf8proc, and ambiguous EAW is something we need to know in order to not regress functionality. Whether these chars are seen as single- or double-width is configurable as an option, and regardless we do the workaround described above to handle discrepancies in fonts.

bfredl · 2024-08-14T08:56:39Z

This is an example how this property will be used in neovim: neovim/neovim#30042 .

clason · 2024-08-29T08:59:16Z

@stevengj any input? This is a bit of a blocker for us.

stevengj · 2024-08-29T13:41:21Z

Seems fine to me; can you add an accessor function to the API? e.g. utf8proc_charwidth_ambiguous

…idth Some characters have their width defined as "Ambiguous" in UAX#11. These are typically rendered as single-width by modern monospace fonts, and utf8proc correctly returns charwidth==1 for these. However some applications might need to support older CJK fonts where characters which where two-byte in legacy encodings were rendered as double-width. An example of this is the 'ambiwidth' option of vim and neovim which supports rendering in terminals using such wideness rules. Add an 'ambiguous_width' property to utf8proc_property_t for such characters.

bfredl · 2024-08-30T08:07:39Z

done.

stevengj · 2024-08-30T16:41:42Z

Note that Unicode 16 looks like it is scheduled to be released on September 10, so it might be good to hold off on a new release for a couple of weeks until we can update the Unicode tables.

ZerdoX-x · 2024-12-08T16:57:38Z

3 months ping. could we release? 👀 @stevengj

stevengj added the enhancement label Aug 12, 2024

bfredl mentioned this pull request Aug 14, 2024

refactor(multibyte): replace generated unicode tables with utf8proc neovim/neovim#30042

Merged

bfredl force-pushed the ambiwidth branch from 0e89fc9 to 8c97229 Compare August 30, 2024 08:06

stevengj merged commit 3de4596 into JuliaStrings:master Aug 30, 2024
12 checks passed

clason mentioned this pull request Sep 9, 2024

Please make a new release with commit 3de4596 #272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

properties: add "ambiwidth" property for ambiguous East Asian Width #270

properties: add "ambiwidth" property for ambiguous East Asian Width #270

bfredl commented Aug 12, 2024

stevengj commented Aug 12, 2024

bfredl commented Aug 12, 2024

bfredl commented Aug 14, 2024

clason commented Aug 29, 2024

stevengj commented Aug 29, 2024

bfredl commented Aug 30, 2024

stevengj commented Aug 30, 2024

ZerdoX-x commented Dec 8, 2024

properties: add "ambiwidth" property for ambiguous East Asian Width #270

properties: add "ambiwidth" property for ambiguous East Asian Width #270

Conversation

bfredl commented Aug 12, 2024

alternatives

stevengj commented Aug 12, 2024

bfredl commented Aug 12, 2024

bfredl commented Aug 14, 2024

clason commented Aug 29, 2024

stevengj commented Aug 29, 2024

bfredl commented Aug 30, 2024

stevengj commented Aug 30, 2024

ZerdoX-x commented Dec 8, 2024