-
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unwanted modifications to non-Japanese scripts #65
Comments
Sorry to hear you are having difficulty with Cutlet.
Can you give an example?
You need to set
That said, Cutlet is not designed to be run on text that is not mostly Japanese. You should probably run your input through a language detector and only run Cutlet on results marked as Japanese. |
It will turn "meaning" into "mean ing", and "[04:30.748]" into something
like "[04 :30 .748 ]"
…On Mon, Dec 16, 2024, 3:43 AM polm ***@***.***> wrote:
Sorry to hear you are having difficulty with Cutlet.
Unfortunately Cutlet adds weird spacing to English text and punctuation,
Can you give an example?
completely freaks out when prevented with Cyrillic characters.
You need to set ensure_ascii=False.
>>> import cutlet
>>> katsu = cutlet.Cutlet(ensure_ascii=False)
>>> katsu.romaji("ГЕЛИОС")
'ГЕЛИОС'
That said, Cutlet is not designed to be run on text that is not mostly
Japanese. You should probably run your input through a language detector
and only run Cutlet on results marked as Japanese.
—
Reply to this email directly, view it on GitHub
<#65 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALGGC2O7QK4CJLZNTCS3FQL2F24G7AVCNFSM6AAAAABTVCERNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVGM4TQNJXG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I see what you mean. There is a way to improve this that I can work on, basically preserving whitespace between two tokens depending on their character class. However, if I understand lyric file formats correctly, this should be a prefix on each line. You should remove it and simply not run it through cutlet.
I can't reproduce this, please give a complete example string. |
#66 should address the parts of this that look reproducible. |
I'm running Cutlet in bulk on a lot of text which may or may not be in Japanese (transliterating all the lyrics in my music folder). Unfortunately Cutlet adds weird spacing to English text and punctuation, and completely freaks out when prevented with Cyrillic characters. I have had to write all kinds of workarounds for this behavior and I'm still not catching all the issues. I would really like if Cutlet did all this detection for me so it only changed the Japanese characters and left everything else alone.
The text was updated successfully, but these errors were encountered: