Unwanted modifications to non-Japanese scripts #65

rexendevar · 2024-12-16T00:52:10Z

I'm running Cutlet in bulk on a lot of text which may or may not be in Japanese (transliterating all the lyrics in my music folder). Unfortunately Cutlet adds weird spacing to English text and punctuation, and completely freaks out when prevented with Cyrillic characters. I have had to write all kinds of workarounds for this behavior and I'm still not catching all the issues. I would really like if Cutlet did all this detection for me so it only changed the Japanese characters and left everything else alone.

polm · 2024-12-16T11:43:21Z

Sorry to hear you are having difficulty with Cutlet.

Unfortunately Cutlet adds weird spacing to English text and punctuation,

Can you give an example?

completely freaks out when prevented with Cyrillic characters.

You need to set ensure_ascii=False.

>>> import cutlet
>>> katsu = cutlet.Cutlet(ensure_ascii=False)
>>> katsu.romaji("ГЕЛИОС")
'ГЕЛИОС'

That said, Cutlet is not designed to be run on text that is not mostly Japanese. You should probably run your input through a language detector and only run Cutlet on results marked as Japanese.

rexendevar · 2024-12-16T11:46:31Z

It will turn "meaning" into "mean ing", and "[04:30.748]" into something like "[04 :30 .748 ]"

…

On Mon, Dec 16, 2024, 3:43 AM polm ***@***.***> wrote: Sorry to hear you are having difficulty with Cutlet. Unfortunately Cutlet adds weird spacing to English text and punctuation, Can you give an example? completely freaks out when prevented with Cyrillic characters. You need to set ensure_ascii=False. >>> import cutlet >>> katsu = cutlet.Cutlet(ensure_ascii=False) >>> katsu.romaji("ГЕЛИОС") 'ГЕЛИОС' That said, Cutlet is not designed to be run on text that is not mostly Japanese. You should probably run your input through a language detector and only run Cutlet on results marked as Japanese. — Reply to this email directly, view it on GitHub <#65 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALGGC2O7QK4CJLZNTCS3FQL2F24G7AVCNFSM6AAAAABTVCERNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVGM4TQNJXG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

polm · 2024-12-18T04:25:44Z

"[04:30.748]" into something like "[04 :30 .748 ]"

I see what you mean. There is a way to improve this that I can work on, basically preserving whitespace between two tokens depending on their character class.

However, if I understand lyric file formats correctly, this should be a prefix on each line. You should remove it and simply not run it through cutlet.

It will turn "meaning" into "mean ing",

I can't reproduce this, please give a complete example string.

polm · 2024-12-20T12:22:59Z

#66 should address the parts of this that look reproducible.

polm added the question Further information is requested label Dec 16, 2024

polm mentioned this issue Dec 20, 2024

Better handling of spaces in ASCII text (fixes #65) #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unwanted modifications to non-Japanese scripts #65

Unwanted modifications to non-Japanese scripts #65

rexendevar commented Dec 16, 2024

polm commented Dec 16, 2024

rexendevar commented Dec 16, 2024 via email

polm commented Dec 18, 2024 •

edited

Loading

polm commented Dec 20, 2024

Unwanted modifications to non-Japanese scripts #65

Unwanted modifications to non-Japanese scripts #65

Comments

rexendevar commented Dec 16, 2024

polm commented Dec 16, 2024

rexendevar commented Dec 16, 2024 via email

polm commented Dec 18, 2024 • edited Loading

polm commented Dec 20, 2024

polm commented Dec 18, 2024 •

edited

Loading