Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777

Open
tats-u opened this issue Sep 23, 2024 · 5 comments

Comments

@tats-u
Copy link

tats-u commented Sep 23, 2024

commonmark/commonmark.js#261
commonmark/commonmark.js#289

https://spec.commonmark.org/0.31.2/#paragraphs

Leading spaces or tabs are skipped:
Example 222

  aaa
 bbb
<p>aaa
bbb</p>

Lines after the first may be indented any amount, since indented code blocks cannot interrupt paragraphs.
Example 223

aaa
             bbb
                                       ccc
<p>aaa
bbb
ccc</p>

However, the first line may be preceded by up to three spaces of indentation. Four spaces of indentation is too many:
Example 224

   aaa
bbb
<p>aaa
bbb</p>

Example 225

    aaa
bbb
<pre><code>aaa
</code></pre>
<p>bbb</p>

Final spaces or tabs are stripped before inline parsing, so a paragraph that ends with two or more spaces will not end with a hard line break:
Example 226

aaa     
bbb     
<p>aaa<br />
bbb</p>

The specification doesn't mention non-ASCII Unicode whitespaces, no-break space, form feed, or vertical tab, but some implementations treats them like ASCII space and tab, and remove trailing or leading ones in paragraphs.

cmark seems to remove at least U+00A0 NBSP, (but not U+3000 or U+2003) preserves U+00A0 too

echo "`u{a0}a`u{a0}" | ./cmark | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 3c 70 3e c2 a0 61 c2 a0 ┊ 3c 2f 70 3e 0a          │<p>××a××┊</p>_   │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

However cmark removes trailing Form Feeds and Vertical Tabs.

PS> echo "guard`n`n`vVT`v`v`n`fFF`f`f`nend"  | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ef bb bf 67 75 61 72 64 ┊ 0a 0a 0b 56 54 0b 0b 0a │×××guard┊__•VT••_│
│00000010│ 0c 46 46 0c 0c 0a 65 6e ┊ 64 0d 0a                │_FF___en┊d__     │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
PS> echo "guard`n`n`vVT`v`v`n`fFF`f`f`nend"  | ./cmark | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 3c 70 3e 67 75 61 72 64 ┊ 3c 2f 70 3e 0a 3c 70 3e │<p>guard┊</p>_<p>│
│00000010│ 0b 56 54 0a 0c 46 46 0a ┊ 65 6e 64 3c 2f 70 3e 0a │•VT__FF_┊end</p>_│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

commonmark.js removes all characters trimmed by trim() of JS.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar

micromark doesn't touch other than ASCII space (U+0020) or tab (U+0009).

Other implementations:

https://babelmark.github.io/?text=%0BVertical+Tab+(U%2B000B)%0B%0A%0A%0CLine+Separator+(U%2B000C)%0C%0A%0A+Space+(U%2B0020)+%0A%0A%09Tab+(U%2B0009)%09%0A%0A%C2%A0NBSP+(U%2B00A0)%C2%A0%0A%0A%E2%80%83EM+Space+(U%2B2003)%E2%80%83%0A%0A%E2%80%A8Line+Separator+(U%2B2028)%E2%80%A8%0A%0A%E2%80%A9Paragraph+Separator+(U%2B2029)%E2%80%A9%0A%0A%E3%80%80Chinese%2FJapanese+Space+(U%2B3000)%E3%80%80%0A

@tats-u tats-u changed the title Ensure that trailing/leading Unicode whitespaces in paragraphs are not removed Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Line Feed/No-break Space in paragraphs are not removed Sep 23, 2024
@tats-u tats-u changed the title Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Line Feed/No-break Space in paragraphs are not removed Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed Sep 23, 2024
@wooorm
Copy link
Contributor

wooorm commented Sep 23, 2024

IMO commonmarkjs / cmark are wrong here. Spec is super clear about this.

@tats-u
Copy link
Author

tats-u commented Sep 23, 2024

IMO commonmarkjs / cmark are wrong here.

I agree with you.

super clear

Does your opinion come from the definition of space and tab? To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions.
I think it would be clearer and stop making library authors worry about extra concerns (i.e. Unicode whitespace handling) if every "space" and "tab" are linked.

Leading [space]s or [tab]s are skipped:

[space]: #space
[tab]: #tab

I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space.
Also, I think the spec should advise library authors to confirm the strict specification (what kind of characters is removed) of .trim() method in the language library and not to use them (JS & .NET remove Unicode ones and Java removes VT & FF) in this context. I think it is a common pitfall.

@wooorm
Copy link
Contributor

wooorm commented Sep 23, 2024

I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space.

It was not clear to me what this issue was about. Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?

To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions.

I don’t see it. The spec is very clear about characters: https://spec.commonmark.org/0.31.2/#characters-and-lines.

Specs are always a bit technical. They must be interpreted strictly.

@tats-u
Copy link
Author

tats-u commented Sep 23, 2024

Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?

I want an example like the following to be available as an officially-provided example or test case:

https://github.com/commonmark/commonmark.js/pull/289/files#diff-76b8b4c10de6a516d009bda44d45cd15db413db5eb2bbbd7f3376d3ad8ee6b8c

The spec is very clear about characters

The syntax itself is so. It might be due to the word choices ("space" and "tab", both are very common words) that library authors have incorrectly implemented the non-Unicode whitespace handling. (or due to the lack of understanding of .trim() methods)

@jgm
Copy link
Member

jgm commented Sep 23, 2024

Agreed, there should be a test case with, say, NBSP at the beginning and end of a paragraph, showing that it is not stripped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants