Make output for emphasis (`_`, `__`, `*`, or `**`) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791

tats-u · 2025-02-24T14:23:06Z

Related: #369

The current specification is thought to treat an isolated surrogate code unit (e.g. "\ud800") as non-punctuation:

The general category of all surrogate code points is Cs, not started with P or S.
The character in the specifications is an Unicode Code Point, not an Unicode Scalar Value (surrogate code points are not included in scalar values)

However, once it is replaced with U+FFFD by e.g. String.prototype.toWellFormed, a conflict occurs: U+FFFD is a punctuation because its general category is So.

Due to this conflict, implementations cannot make the input string well-formed (replace invalid code unit sequences with U+FFFD) in advance.

Example:

Input is: "a*\ud800*" in the UTF-16 form (e.g. JS)
An implementation will do:
1. Convert the input to HTML
2. Replace isolated surrogate code units (\ud800 for the above example) with U+FFFD
If the implementation do i. first, it will get: a&#xFFFD.
If the implementation do ii. first, it will get: a*&#xFFFD*.

To fix this conflict, I think that it is needed to define that whether ill-formed code unit sequences are treated as punctuations or not is implementation-defined behavior. This change will reduce the burden to create an implementation and will not bring breaking changes on any existing normal Markdown document files.

The following screenshot shows that an escape like &#xD800 in HTML is replaced with U+FFFD when rendered by Firefox:

It shows that a code point escape for an isolated surrogate pair in HTML (and Markdown) is ultimately equivalent to &#xFFFD.

The text was updated successfully, but these errors were encountered:

wooorm · 2025-02-25T11:39:37Z

&#xD800 -> U+FFFD

I think the behavior you show here has to do with https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state. The 3rd point on surrogates. Which relates to #614.

tats-u · 2025-02-28T13:00:14Z

I think the behavior you show here has to do with https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state.

Thank you for finding it out in favor of me. It is what I was going to search for.

If the number is a surrogate, then this is a surrogate-character-reference parse error. Set the character reference code to 0xFFFD.

The following is ill-formed in UTF-8/32 encodings, too:

If the number is greater than 0x10FFFF, then this is a character-reference-outside-unicode-range parse error. Set the character reference code to 0xFFFD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make output for emphasis (`_`, `__`, `*`, or `**`) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791

Make output for emphasis (`_`, `__`, `*`, or `**`) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791

tats-u commented Feb 24, 2025

wooorm commented Feb 25, 2025

tats-u commented Feb 28, 2025

Make output for emphasis (_, __, *, or **) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791

Make output for emphasis (_, __, *, or **) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791

Comments

tats-u commented Feb 24, 2025

wooorm commented Feb 25, 2025

tats-u commented Feb 28, 2025

Make output for emphasis (`_`, `__`, `*`, or `**`) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791

Make output for emphasis (`_`, `__`, `*`, or `**`) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791