Skip to content

Commit

Permalink
Add new rule to keep whitespace between ascii tokens
Browse files Browse the repository at this point in the history
This resolves most of the weird spacing around ASCII input. Note that it
has to happen at the end of the processing pipeline to work with rules
that make sure ascii-ish punctuation next to Japanese gets the right
results.
  • Loading branch information
polm committed Dec 20, 2024
1 parent 519e374 commit c368f5c
Showing 1 changed file with 7 additions and 0 deletions.
7 changes: 7 additions & 0 deletions cutlet/cutlet.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,13 @@ def romaji_tokens(self, words, capitalize=True, title=False):
out.append(tok)
continue

# preserve spaces between ascii tokens
if (word.surface.isascii() and
nw and nw.surface.isascii()):
use_space = bool(nw.white_space)
out.append(Token(word.surface, use_space))
continue

out.append(tok)

# no space sometimes
Expand Down

0 comments on commit c368f5c

Please sign in to comment.