-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define line break styles for east asian characters as options #411
Define line break styles for east asian characters as options #411
Conversation
This commit aims to produce more natural line breaks in the rendered output.
In this case, "naturally" varies from person to person. It is more preferable for me this is an option for users. |
Thank you for your feedback. I completely understand your viewpoint. The perception of what appears "natural" can indeed vary among different users.
With that in mind, I fully support making this feature an optional behavior for users. I can proceed with implementing this as an option if everyone agrees. Looking forward to further feedback and suggestions. |
It might be better to say that options for For example: extension.NewCJK(
extension.WithEastAsianLineBreaks(
extension.WithXXX(),
),
) |
I've followed your suggestion and added a |
Thanks for updating PR. Segmentation break problems have been discussed for a long, especially in CSS specification. 'Natural' segmentation break is really hard, depends on many languages. In the past, CSS specifications draft defined segmentation breaks as follows.
I think option names like CSS word break property values are more flexible than concrete option names like Like the following options, 'style' aggregates detailed options into one option value:
And
Note that for backward compatibilities, |
Thank you for your detailed feedback. I've closely aligned the line break specifications with your intentions, which is why we opted for the current implementation. We wanted a solution that was both intuitive for users and consistent with the desired functionality. Your reference to the ongoing discussions surrounding the CSS specifications has been invaluable. I have delved into the extensive debates and found them to be quite intricate. Honestly, fully comprehending the nuances of the CSS line break specifications has been a challenge. Given its complexity, achieving a perfect implementation of the CSS line break specifications would indeed be tough. However, in keeping with your intentions and after considering your feedback, I've aimed for an approach that enhances code readability and extensibility. This ensures that future modifications or additions can be seamlessly integrated. Your suggestion of using Regarding backward compatibility, I've ensured that I genuinely value your insights and constructive feedback. I'll continue refining the implementation to make it as adaptable and efficient as possible. Please let me know if there are other concerns or recommendations you'd like to share. |
EastAsianLineBreaksStyleSimple EastAsianLineBreaksStyle = iota | ||
// EastAsianLineBreaksCSS3Draft is a style where soft line breaks are ignored | ||
// even if only one side of the break is an east asian wide character. | ||
EastAsianLineBreaksCSS3Draft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a better name than EastAsianLineBreaksCSS3Draft
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems your implementation does not satisfy CSS3 draft rules. We may have choices...
- Implement CSS3 Draft rules and leave this name as it is
- Change this name to represent your implementation
I prefer 1.
- Implement CSS text level 3 draft rules.
- Implement additional enhancement(i.e. resolves [css-text-3] Segment Break Transformation Rules around CJK Punctuation w3c/csswg-drafts#5086)
- Write README like 'This option implements CSS text level3 Segment Break Transformation Rules with some enhancements'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your thorough feedback.
I’m on board with option 1 and will give the CSS text level 3 rules and additional enhancements a shot. Admittedly, I'm not a pro with this CSS issue, so while I'll do my best, I might miss some nuances. Any extra guidance or pointers while I work through this would be awesome!
Will keep you posted on the progress. Talk to you soon!
sibling := node.NextSibling() | ||
if sibling != nil && sibling.Kind() == ast.KindText { | ||
if siblingText := sibling.(*ast.Text).Text(source); len(siblingText) != 0 { | ||
thisLastRune := util.ToRune(value, len(value)-1) | ||
siblingFirstRune, _ := utf8.DecodeRune(siblingText) | ||
if !(util.IsEastAsianWideRune(thisLastRune) && | ||
util.IsEastAsianWideRune(siblingFirstRune)) { | ||
if r.EastAsianLineBreaks.EastAsianLineBreaksFunction.SoftLineBreak(thisLastRune, siblingFirstRune) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am convinced that this is one of the things we should implement.
renderer/html/html.go
Outdated
EastAsianLineBreaksCSS3Draft | ||
) | ||
|
||
type eastAsianLineBreaksFunction interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name does not meet conventions in Go.
softLineBreaker
is better for this kind of interfaces.
For this kind of interfaces that has only one method, we often define a factory function (i.e. http.Handler
and http.HandlerFunc
).
renderer/html/html.go
Outdated
type eastAsianLineBreaksCSS3Draft struct{} | ||
|
||
func (e *eastAsianLineBreaksCSS3Draft) SoftLineBreak(thisLastRune rune, siblingFirstRune rune) bool { | ||
return !(util.IsEastAsianWideRune(thisLastRune) || util.IsEastAsianWideRune(siblingFirstRune)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this does not satisfy CSS3 Draft rules.
- If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.
- Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.
- Otherwise, the segment break is converted to a space (U+0020).
So CSS3Draft
is not suitable for this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try implementing based on the provided direction, though I'm not fully confident I've grasped all the details (still wrapping my head around Unicode handling). Might need your patience and continued reviews – really appreciate it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the CSS Text Module Level 3, it appears the current usage is as shown in the following screenshot. May I proceed to implement the specifications (+ enhancements) you pointed out as CSS3Draft
?
CSS3Draft
:
- If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.
- Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.
- Otherwise, if either the character before or after the segment break belongs to the space-discarding character set and is a Unicode Punctuation (P*) or U+3000, then the segment break is removed (cf. [css-text-3] Segment Break Transformation Rules around CJK Punctuation w3c/csswg-drafts#5086)
- Otherwise, the segment break is converted to a space (U+0020).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Some implementation of this algorithm may be helpful for you
This plugin finds and removes newlines that cannot be converted to space, algorithm matches CSS Text Module Level 3:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the information. I will begin implementing CSS3Draft
using the rules discussed in this thread!
EastAsianLineBreaksStyleSimple EastAsianLineBreaksStyle = iota | ||
// EastAsianLineBreaksCSS3Draft is a style where soft line breaks are ignored | ||
// even if only one side of the break is an east asian wide character. | ||
EastAsianLineBreaksCSS3Draft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems your implementation does not satisfy CSS3 draft rules. We may have choices...
- Implement CSS3 Draft rules and leave this name as it is
- Change this name to represent your implementation
I prefer 1.
- Implement CSS text level 3 draft rules.
- Implement additional enhancement(i.e. resolves [css-text-3] Segment Break Transformation Rules around CJK Punctuation w3c/csswg-drafts#5086)
- Write README like 'This option implements CSS text level3 Segment Break Transformation Rules with some enhancements'
I've finally completed the implementation of P.S. |
I've merged the PR and made some improvements codes in 9c90033. Thanks for your contribution. |
…eak styles This commit follows yuin/goldmark#411
…eak styles This commit follows yuin/goldmark#411
Thank you for the merge. I also appreciate the extensive reviews and explanations of the direction. |
|
In this pull request, we've addressed the rendering of line breaks, especially concerning the interaction between Western and East Asian wide characters. The aim is to ensure that the output is more intuitive and naturally rendered.
Changes:
Refactoring in
renderText
Method:Updated Test Cases:
cjk_test.go
to reflect the desired behavior.Through these changes, we hope to enhance the readability of rendered content, especially when dealing with mixed character types. We kindly ask for your review and feedback on these modifications.
Example 1:
Input Markdown:
Current Output:
Expected Output:
Example 2:
Input Markdown:
I am a programmer. <!-- notice: there is a white space after the last period --> 私はプログラマーです。
Current Output:
Expected Output: