Skip to content
This repository has been archived by the owner on Sep 29, 2022. It is now read-only.

surface_forms does not properly pick out the reading for '小学校' #6

Open
pganssle opened this issue Apr 30, 2016 · 0 comments
Open

Comments

@pganssle
Copy link
Collaborator

pganssle commented Apr 30, 2016

I've written a little script that tries to infer the reading for each kanji as it is used in a given compound. The way I'm doing it is by pulling all possible combinations of readings from KanjiDic, then running surface_forms over all possible segments to see if one of them matches the known reading.

However, this is occasionally failing because of onbin variation that occurs for the second segment of a word, e.g. 小学校. The reading for this is 小[しょう]学[がっ]校[こう] - the reading for 学 is an onbin variation on its onyomi: ガク. (Note: I'm using the terminology from the library - I'm not familiar enough with Japanese phonology to know what's actually happening on a linguistic level.)

Is this just irregular phonetic variation, or should surface_forms be modified so all but the last segment can undergo onbin variation?

Here is an MWE:

from cjktools.resources import kanjidic
from cjktools.alternations import surface_forms

from itertools import chain, product, starmap

kdict = kanjidic.Kanjidic()

def get_kanji_reading(compound, reading):
    kana_segments = product(*(kdict[kanji].all_readings for kanji in compound))

    for cand in chain.from_iterable(map(surface_forms, kana_segments)):
        if ''.join(cand) == reading:
            return tuple(zip(compound, cand))

    return ((compound, reading),)

pairs = [(u'小学校', u'しょうがっこう'), (u'醸造', u'じょうぞう')]

for compound, reading in pairs:
    print(''.join(starmap('{}[{}]'.format,
                          get_kanji_reading(compound, reading))))

Which returns:

小学校[しょうがっこう]
醸[じょう]造[ぞう]
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant