Romanizing each single character after tagging and tokenization #64

yaserahmady · 2024-11-20T18:13:27Z

yaserahmady
Nov 20, 2024

Hello there, great job on cutlet, thank you!

I've made a little tool for myself for learning Japanese and taking notes during class. While learning kanas I like having this tool to quickly romanize some Japanese text and converting it to Ruby characters with a little markup to paste in the Obsidian app (the plugin I use to parse this markup is https://github.com/steven-kraft/obsidian-markdown-furigana).

It mostly works fine but right now this is an example output which I'd love to improve:

にほんnihonごgo　でde　あいさつaisatsu　をwo　しshiましょうmashou

$ romaji.py にほんごであいさつをしましょう

Original: にほんご　で　あいさつ　を　しましょう　
Ruby:     {にほん|nihon}{ご|go}　{で|de}　{あいさつ|aisatsu}　{を|wo}　{し|shi}{ましょう|mashou}　
Romaji:   nihongo de aisatsu wo shimashou

I'd like to have the Ruby text for each single character, like this expected output:

にniほhoんnごgo　でde　あaいiさsaつtsu　をwo　しshiまmaしょshoうu

$ romaji.py にほんごであいさつをしましょう

Original: にほんご　で　あいさつ　を　しましょう　
Ruby:     {に|ni}{ほ|ho}{ん|n}{ご|go}　{で|de}　{あ|a}{い|i}{さ|sa}{つ|tsu}　{を|wo}　{し|shi}{ま|ma}{しょ|sho}{う|u}
Romaji:   nihongo de aisatsu wo shimashou

The issue is that it looks like you're getting the correct romanization based on some rules and unidic so I cannot simply loop through each character and call katsu.romaji(single_char) because often the romanized result would be incorrect and also this approach would treat しょ as 2 separate characters.

This is my whole script:

import sys
from string import ascii_letters, punctuation

import cutlet

katsu = cutlet.Cutlet()


def main():
    input_text = " ".join(sys.argv[1:])

    if not input_text:
        print("No input text")
        sys.exit(1)

    jt = JapaneseText(input_text)

    print(f"Original: {jt.original_text}")
    print(f"Ruby:     {jt.to_obsidian_ruby}")
    print(f"Romaji:   {jt.romaji_text}")


class JapaneseText:
    SPACE = "　"

    def __init__(self, input_text):
        # Remove all spaces
        input_text = input_text.replace(" ", "")

        # Remove all ASCII letters
        for letter in ascii_letters:
            input_text = input_text.replace(letter, "")

        self.input_text = input_text

        self.romaji_text = katsu.romaji(self.input_text).lower()
        self.normalized_text = cutlet.normalize_text(self.input_text)
        self.words = katsu.tagger(self.normalized_text)
        self.tokens = katsu.romaji_tokens(self.words)

    @property
    def original_text(self):
        result = ""
        for token, word in zip(self.tokens, self.words):
            space = self.SPACE if token.space else ""
            result += str(word) + space

        return result

    @property
    def to_obsidian_ruby(self):
        result = ""
        for token, word in zip(self.tokens, self.words):
            space = self.SPACE if token.space else ""
            if not self.is_punctuation(word):
                result += f"{{{word.surface}|{token.surface.lower()}}}{space}"
            else:
                result += str(word)

        return result

    def is_punctuation(self, word):
        return (
            str(word) in ["。", "、", "！", "？", "：", "；", "「", "」"]
            or str(word) in punctuation
        )


if __name__ == "__main__":
    main()

This is what I tried to get single character romanization:

def to_obsidian_ruby_single_character(text):
    kanas = list(text)
    result = ""
    for kana in kanas:
        romaji = katsu.romaji(kana)
        result += "{" + kana + "|" + romaji.lower() + "}"
    return f"{result}"

polm · 2024-12-01T03:15:22Z

polm
Dec 1, 2024
Maintainer

If I understand correctly, the issue is that you want to be able to turn sequences like しょ into a single unit for furigana. In that case I recommend you use the mapping in cutlet (or your own mapping) and repeat the process of taking longest match, knowing that you only need to consider sequences of two characters. It would look a bit like this:

text = "しょうがない"
out = ""
for ii in range(text):
    if text[ii:ii+2] in mapping:
        out += mapping[text[ii:ii+2]]
    elif text[ii] in mapping:
        out += mapping[text[ii]]
    else:
        # not in mapping so output as-is
        out += text[ii]

Since you are working to learn Japanese, however, I would strongly recommend you don't do this and focus on learning kana first without relying on romaji.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Romanizing each single character after tagging and tokenization #64

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Romanizing each single character after tagging and tokenization #64

yaserahmady Nov 20, 2024

Replies: 1 comment

polm Dec 1, 2024 Maintainer

yaserahmady
Nov 20, 2024

polm
Dec 1, 2024
Maintainer