Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted modifications to non-Japanese scripts #65

Open
rexendevar opened this issue Dec 16, 2024 · 4 comments
Open

Unwanted modifications to non-Japanese scripts #65

rexendevar opened this issue Dec 16, 2024 · 4 comments
Labels
question Further information is requested

Comments

@rexendevar
Copy link

I'm running Cutlet in bulk on a lot of text which may or may not be in Japanese (transliterating all the lyrics in my music folder). Unfortunately Cutlet adds weird spacing to English text and punctuation, and completely freaks out when prevented with Cyrillic characters. I have had to write all kinds of workarounds for this behavior and I'm still not catching all the issues. I would really like if Cutlet did all this detection for me so it only changed the Japanese characters and left everything else alone.

@polm
Copy link
Owner

polm commented Dec 16, 2024

Sorry to hear you are having difficulty with Cutlet.

Unfortunately Cutlet adds weird spacing to English text and punctuation,

Can you give an example?

completely freaks out when prevented with Cyrillic characters.

You need to set ensure_ascii=False.

>>> import cutlet
>>> katsu = cutlet.Cutlet(ensure_ascii=False)
>>> katsu.romaji("ГЕЛИОС")
'ГЕЛИОС'

That said, Cutlet is not designed to be run on text that is not mostly Japanese. You should probably run your input through a language detector and only run Cutlet on results marked as Japanese.

@polm polm added the question Further information is requested label Dec 16, 2024
@rexendevar
Copy link
Author

rexendevar commented Dec 16, 2024 via email

@polm
Copy link
Owner

polm commented Dec 18, 2024

"[04:30.748]" into something like "[04 :30 .748 ]"

I see what you mean. There is a way to improve this that I can work on, basically preserving whitespace between two tokens depending on their character class.

However, if I understand lyric file formats correctly, this should be a prefix on each line. You should remove it and simply not run it through cutlet.

It will turn "meaning" into "mean ing",

I can't reproduce this, please give a complete example string.

@polm
Copy link
Owner

polm commented Dec 20, 2024

#66 should address the parts of this that look reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants