Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make output for emphasis (_, __, *, or **) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior #791

Open
tats-u opened this issue Feb 24, 2025 · 2 comments

Comments

@tats-u
Copy link

tats-u commented Feb 24, 2025

Related: #369

micromark/micromark#190 (comment)

The current specification is thought to treat an isolated surrogate code unit (e.g. "\ud800") as non-punctuation:

  • The general category of all surrogate code points is Cs, not started with P or S.
  • The character in the specifications is an Unicode Code Point, not an Unicode Scalar Value (surrogate code points are not included in scalar values)

However, once it is replaced with U+FFFD by e.g. String.prototype.toWellFormed, a conflict occurs: U+FFFD is a punctuation because its general category is So.

Due to this conflict, implementations cannot make the input string well-formed (replace invalid code unit sequences with U+FFFD) in advance.

Example:

  • Input is: "a*\ud800*" in the UTF-16 form (e.g. JS)

  • An implementation will do:

    1. Convert the input to HTML
    2. Replace isolated surrogate code units (\ud800 for the above example) with U+FFFD
  • If the implementation do i. first, it will get: <p>a<strong>&#xFFFD</strong></p>.

  • If the implementation do ii. first, it will get: <p>a*&#xFFFD*</p>.

To fix this conflict, I think that it is needed to define that whether ill-formed code unit sequences are treated as punctuations or not is implementation-defined behavior. This change will reduce the burden to create an implementation and will not bring breaking changes on any existing normal Markdown document files.

The following screenshot shows that an escape like &#xD800 in HTML is replaced with U+FFFD when rendered by Firefox:

Image

It shows that a code point escape for an isolated surrogate pair in HTML (and Markdown) is ultimately equivalent to &#xFFFD.

@wooorm
Copy link
Contributor

wooorm commented Feb 25, 2025

&#xD800 -> U+FFFD

I think the behavior you show here has to do with https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state. The 3rd point on surrogates. Which relates to #614.

@tats-u
Copy link
Author

tats-u commented Feb 28, 2025

I think the behavior you show here has to do with https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state.

Thank you for finding it out in favor of me. It is what I was going to search for.

If the number is a surrogate, then this is a surrogate-character-reference parse error. Set the character reference code to 0xFFFD.

The following is ill-formed in UTF-8/32 encodings, too:

If the number is greater than 0x10FFFF, then this is a character-reference-outside-unicode-range parse error. Set the character reference code to 0xFFFD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants