Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define line break styles for east asian characters as options #411

Merged
merged 8 commits into from
Oct 28, 2023
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 40 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -379,10 +379,48 @@ This extension provides additional options for CJK users.

| Functional option | Type | Description |
| ----------------- | ---- | ----------- |
| `extension.WithEastAsianLineBreaks` | `-` | Soft line breaks are rendered as a newline. Some asian users will see it as an unnecessary space. With this option, soft line breaks between east asian wide characters will be ignored. |
| `extension.WithEastAsianLineBreaks` | `...extension.EastAsianLineBreaksStyle` | Soft line breaks are rendered as a newline. Some asian users will see it as an unnecessary space. With this option, soft line breaks between east asian wide characters will be ignored. |
| `extension.WithEscapedSpace` | `-` | Without spaces around an emphasis started with east asian punctuations, it is not interpreted as an emphasis(as defined in CommonMark spec). With this option, you can avoid this inconvenient behavior by putting 'not rendered' spaces around an emphasis like `太郎は\ **「こんにちわ」**\ といった`. |


#### Styles of Line Breaking

| Style | Description |
| ----- | ----------- |
| `EastAsianLineBreaksStyleSimple` | Soft line breaks are ignored if both sides of the break are east asian wide character. This behavior is the same as [`east_asian_line_breaks`](https://pandoc.org/MANUAL.html#extension-east_asian_line_breaks) in Pandoc. |
| `EastAsianLineBreaksCSS3Draft` | Soft line breaks are ignored even if only one side of the break is east asian wide character. |

#### Example of `EastAsianLineBreaksStyleSimple`

Input Markdown:

```md
私はプログラマーです。
東京の会社に勤めています。
GoでWebアプリケーションを開発しています。
```

Output:

```md
<p>私はプログラマーです。東京の会社に勤めています。\nGoでWebアプリケーションを開発しています。</p>
```

#### Example of `EastAsianLineBreaksCSS3Draft`

Input Markdown:

```md
私はプログラマーです。
東京の会社に勤めています。
GoでWebアプリケーションを開発しています。
```

Output:

```md
<p>私はプログラマーです。東京の会社に勤めています。GoでWebアプリケーションを開発しています。</p>
```

Security
--------------------
By default, goldmark does not render raw HTML or potentially-dangerous URLs.
Expand Down
41 changes: 36 additions & 5 deletions extension/cjk.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,30 @@ import (
// A CJKOption sets options for CJK support mostly for HTML based renderers.
type CJKOption func(*cjk)

// A EastAsianLineBreaksStyle is a style of east asian line breaks.
type EastAsianLineBreaksStyle int

const (
// EastAsianLineBreaksStyleSimple is a style where soft line breaks are ignored
// if both sides of the break are east asian wide characters.
EastAsianLineBreaksStyleSimple EastAsianLineBreaksStyle = iota
// EastAsianLineBreaksCSS3Draft is a style where soft line breaks are ignored
// even if only one side of the break is an east asian wide character.
EastAsianLineBreaksCSS3Draft
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a better name than EastAsianLineBreaksCSS3Draft?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems your implementation does not satisfy CSS3 draft rules. We may have choices...

  1. Implement CSS3 Draft rules and leave this name as it is
  2. Change this name to represent your implementation

I prefer 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your thorough feedback.

I’m on board with option 1 and will give the CSS text level 3 rules and additional enhancements a shot. Admittedly, I'm not a pro with this CSS issue, so while I'll do my best, I might miss some nuances. Any extra guidance or pointers while I work through this would be awesome!

Will keep you posted on the progress. Talk to you soon!

)

// WithEastAsianLineBreaks is a functional option that indicates whether softline breaks
// between east asian wide characters should be ignored.
func WithEastAsianLineBreaks() CJKOption {
func WithEastAsianLineBreaks(style ...EastAsianLineBreaksStyle) CJKOption {
return func(c *cjk) {
c.EastAsianLineBreaks = true
e := &eastAsianLineBreaks{
Enabled: true,
EastAsianLineBreaksStyle: EastAsianLineBreaksStyleSimple,
}
for _, s := range style {
e.EastAsianLineBreaksStyle = s
}
c.EastAsianLineBreaks = e
}
}

Expand All @@ -25,10 +44,15 @@ func WithEscapedSpace() CJKOption {
}

type cjk struct {
EastAsianLineBreaks bool
EastAsianLineBreaks *eastAsianLineBreaks
EscapedSpace bool
}

type eastAsianLineBreaks struct {
Enabled bool
EastAsianLineBreaksStyle EastAsianLineBreaksStyle
}

// CJK is a goldmark extension that provides functionalities for CJK languages.
var CJK = NewCJK(WithEastAsianLineBreaks(), WithEscapedSpace())

Expand All @@ -42,8 +66,15 @@ func NewCJK(opts ...CJKOption) goldmark.Extender {
}

func (e *cjk) Extend(m goldmark.Markdown) {
if e.EastAsianLineBreaks {
m.Renderer().AddOptions(html.WithEastAsianLineBreaks())
if e.EastAsianLineBreaks != nil {
if e.EastAsianLineBreaks.Enabled {
style := html.EastAsianLineBreaksStyleSimple
switch e.EastAsianLineBreaks.EastAsianLineBreaksStyle {
case EastAsianLineBreaksCSS3Draft:
style = html.EastAsianLineBreaksCSS3Draft
}
m.Renderer().AddOptions(html.WithEastAsianLineBreaks(style))
}
}
if e.EscapedSpace {
m.Renderer().AddOptions(html.WithWriter(html.NewWriter(html.WithEscapedSpace())))
Expand Down
33 changes: 33 additions & 0 deletions extension/cjk_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -208,4 +208,37 @@ func TestEastAsianLineBreaks(t *testing.T) {
},
t,
)

// test with EastAsianLineBreaksCSS3Draft
markdown = goldmark.New(goldmark.WithRendererOptions(
html.WithXHTML(),
html.WithUnsafe(),
),
goldmark.WithExtensions(
NewCJK(WithEastAsianLineBreaks(EastAsianLineBreaksCSS3Draft)),
),
)
no = 9
testutil.DoTestCase(
markdown,
testutil.MarkdownTestCase{
No: no,
Description: "Soft line breaks between a western character and an east asian wide character are ignored",
Markdown: "太郎は\\ **「こんにちわ」**\\ と言ったa\nんです",
Expected: "<p>太郎は\\ <strong>「こんにちわ」</strong>\\ と言ったaんです</p>",
},
t,
)

no = 10
testutil.DoTestCase(
markdown,
testutil.MarkdownTestCase{
No: no,
Description: "Soft line breaks between an east asian wide character and a western character are ignored",
Markdown: "太郎は\\ **「こんにちわ」**\\ と言った\nbんです",
Expected: "<p>太郎は\\ <strong>「こんにちわ」</strong>\\ と言ったbんです</p>",
},
t,
)
}
75 changes: 65 additions & 10 deletions renderer/html/html.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ import (
type Config struct {
Writer Writer
HardWraps bool
EastAsianLineBreaks bool
EastAsianLineBreaks eastAsianLineBreaks
XHTML bool
Unsafe bool
}
Expand All @@ -26,7 +26,7 @@ func NewConfig() Config {
return Config{
Writer: DefaultWriter,
HardWraps: false,
EastAsianLineBreaks: false,
EastAsianLineBreaks: eastAsianLineBreaks{},
XHTML: false,
Unsafe: false,
}
Expand All @@ -38,7 +38,7 @@ func (c *Config) SetOption(name renderer.OptionName, value interface{}) {
case optHardWraps:
c.HardWraps = value.(bool)
case optEastAsianLineBreaks:
c.EastAsianLineBreaks = value.(bool)
c.EastAsianLineBreaks = value.(eastAsianLineBreaks)
case optXHTML:
c.XHTML = value.(bool)
case optUnsafe:
Expand Down Expand Up @@ -103,24 +103,80 @@ func WithHardWraps() interface {
// EastAsianLineBreaks is an option name used in WithEastAsianLineBreaks.
const optEastAsianLineBreaks renderer.OptionName = "EastAsianLineBreaks"

// A EastAsianLineBreaksStyle is a style of east asian line breaks.
type EastAsianLineBreaksStyle int

const (
// EastAsianLineBreaksStyleSimple is a style where soft line breaks are ignored
// if both sides of the break are east asian wide characters.
EastAsianLineBreaksStyleSimple EastAsianLineBreaksStyle = iota
// EastAsianLineBreaksCSS3Draft is a style where soft line breaks are ignored
// even if only one side of the break is an east asian wide character.
EastAsianLineBreaksCSS3Draft
)

type eastAsianLineBreaksFunction interface {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name does not meet conventions in Go.
softLineBreaker is better for this kind of interfaces.

For this kind of interfaces that has only one method, we often define a factory function (i.e. http.Handler and http.HandlerFunc).

SoftLineBreak(thisLastRune rune, siblingFirstRune rune) bool
}

type eastAsianLineBreaksSimple struct{}

func (e *eastAsianLineBreaksSimple) SoftLineBreak(thisLastRune rune, siblingFirstRune rune) bool {
return !(util.IsEastAsianWideRune(thisLastRune) && util.IsEastAsianWideRune(siblingFirstRune))
}

type eastAsianLineBreaksCSS3Draft struct{}

func (e *eastAsianLineBreaksCSS3Draft) SoftLineBreak(thisLastRune rune, siblingFirstRune rune) bool {
return !(util.IsEastAsianWideRune(thisLastRune) || util.IsEastAsianWideRune(siblingFirstRune))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this does not satisfy CSS3 Draft rules.

  • If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.
  • Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.
  • Otherwise, the segment break is converted to a space (U+0020).

So CSS3Draft is not suitable for this implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try implementing based on the provided direction, though I'm not fully confident I've grasped all the details (still wrapping my head around Unicode handling). Might need your patience and continued reviews – really appreciate it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the CSS Text Module Level 3, it appears the current usage is as shown in the following screenshot. May I proceed to implement the specifications (+ enhancements) you pointed out as CSS3Draft?

CSS3Draft:

  1. If the character immediately before or immediately after the segment break is the zero-width space character (U+200B), then the break is removed, leaving behind the zero-width space.
  2. Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.
  3. Otherwise, if either the character before or after the segment break belongs to the space-discarding character set and is a Unicode Punctuation (P*) or U+3000, then the segment break is removed (cf. [css-text-3] Segment Break Transformation Rules around CJK Punctuation w3c/csswg-drafts#5086)
  4. Otherwise, the segment break is converted to a space (U+0020).
image

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Some implementation of this algorithm may be helpful for you

This plugin finds and removes newlines that cannot be converted to space, algorithm matches CSS Text Module Level 3:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information. I will begin implementing CSS3Draft using the rules discussed in this thread!

}

type eastAsianLineBreaks struct {
Enabled bool
EastAsianLineBreaksFunction eastAsianLineBreaksFunction
}

type withEastAsianLineBreaks struct {
eastAsianLineBreaksStyle EastAsianLineBreaksStyle
}

func (o *withEastAsianLineBreaks) SetConfig(c *renderer.Config) {
c.Options[optEastAsianLineBreaks] = true
switch o.eastAsianLineBreaksStyle {
case EastAsianLineBreaksStyleSimple:
c.Options[optEastAsianLineBreaks] = eastAsianLineBreaks{
Enabled: true,
EastAsianLineBreaksFunction: &eastAsianLineBreaksSimple{},
}
case EastAsianLineBreaksCSS3Draft:
c.Options[optEastAsianLineBreaks] = eastAsianLineBreaks{
Enabled: true,
EastAsianLineBreaksFunction: &eastAsianLineBreaksCSS3Draft{},
}
}
}

func (o *withEastAsianLineBreaks) SetHTMLOption(c *Config) {
c.EastAsianLineBreaks = true
switch o.eastAsianLineBreaksStyle {
case EastAsianLineBreaksStyleSimple:
c.EastAsianLineBreaks = eastAsianLineBreaks{
Enabled: true,
EastAsianLineBreaksFunction: &eastAsianLineBreaksSimple{},
}
case EastAsianLineBreaksCSS3Draft:
c.EastAsianLineBreaks = eastAsianLineBreaks{
Enabled: true,
EastAsianLineBreaksFunction: &eastAsianLineBreaksCSS3Draft{},
}
}
}

// WithEastAsianLineBreaks is a functional option that indicates whether softline breaks
// between east asian wide characters should be ignored.
func WithEastAsianLineBreaks() interface {
func WithEastAsianLineBreaks(style EastAsianLineBreaksStyle) interface {
renderer.Option
Option
} {
return &withEastAsianLineBreaks{}
return &withEastAsianLineBreaks{style}
}

// XHTML is an option name used in WithXHTML.
Expand Down Expand Up @@ -663,14 +719,13 @@ func (r *Renderer) renderText(w util.BufWriter, source []byte, node ast.Node, en
_, _ = w.WriteString("<br>\n")
}
} else if n.SoftLineBreak() {
if r.EastAsianLineBreaks && len(value) != 0 {
if r.EastAsianLineBreaks.Enabled && len(value) != 0 {
sibling := node.NextSibling()
if sibling != nil && sibling.Kind() == ast.KindText {
if siblingText := sibling.(*ast.Text).Text(source); len(siblingText) != 0 {
thisLastRune := util.ToRune(value, len(value)-1)
siblingFirstRune, _ := utf8.DecodeRune(siblingText)
if !(util.IsEastAsianWideRune(thisLastRune) &&
util.IsEastAsianWideRune(siblingFirstRune)) {
if r.EastAsianLineBreaks.EastAsianLineBreaksFunction.SoftLineBreak(thisLastRune, siblingFirstRune) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am convinced that this is one of the things we should implement.

_ = w.WriteByte('\n')
}
}
Expand Down