You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make output for emphasis (_, __, *, or **) adjacent to ill-formed code unit sequence (e.g. isolated surrogate code unit) implementation-defined behavior
#791
Open
tats-u opened this issue
Feb 24, 2025
· 2 comments
Due to this conflict, implementations cannot make the input string well-formed (replace invalid code unit sequences with U+FFFD) in advance.
Example:
Input is: "a*\ud800*" in the UTF-16 form (e.g. JS)
An implementation will do:
Convert the input to HTML
Replace isolated surrogate code units (\ud800 for the above example) with U+FFFD
If the implementation do i. first, it will get: <p>a<strong>�</strong></p>.
If the implementation do ii. first, it will get: <p>a*�*</p>.
To fix this conflict, I think that it is needed to define that whether ill-formed code unit sequences are treated as punctuations or not is implementation-defined behavior. This change will reduce the burden to create an implementation and will not bring breaking changes on any existing normal Markdown document files.
The following screenshot shows that an escape like � in HTML is replaced with U+FFFD when rendered by Firefox:
It shows that a code point escape for an isolated surrogate pair in HTML (and Markdown) is ultimately equivalent to �.
The text was updated successfully, but these errors were encountered:
Related: #369
micromark/micromark#190 (comment)
The current specification is thought to treat an isolated surrogate code unit (e.g.
"\ud800"
) as non-punctuation:Cs
, not started withP
orS
.However, once it is replaced with U+FFFD by e.g.
String.prototype.toWellFormed
, a conflict occurs: U+FFFD is a punctuation because its general category is So.Due to this conflict, implementations cannot make the input string well-formed (replace invalid code unit sequences with U+FFFD) in advance.
Example:
Input is:
"a*\ud800*"
in the UTF-16 form (e.g. JS)An implementation will do:
\ud800
for the above example) with U+FFFDIf the implementation do i. first, it will get:
<p>a<strong>�</strong></p>
.If the implementation do ii. first, it will get:
<p>a*�*</p>
.To fix this conflict, I think that it is needed to define that whether ill-formed code unit sequences are treated as punctuations or not is implementation-defined behavior. This change will reduce the burden to create an implementation and will not bring breaking changes on any existing normal Markdown document files.
The following screenshot shows that an escape like
�
in HTML is replaced with U+FFFD when rendered by Firefox:It shows that a code point escape for an isolated surrogate pair in HTML (and Markdown) is ultimately equivalent to
�
.The text was updated successfully, but these errors were encountered: