surface_forms does not properly pick out the reading for '小学校' #6

pganssle · 2016-04-30T20:02:57Z

I've written a little script that tries to infer the reading for each kanji as it is used in a given compound. The way I'm doing it is by pulling all possible combinations of readings from KanjiDic, then running surface_forms over all possible segments to see if one of them matches the known reading.

However, this is occasionally failing because of onbin variation that occurs for the second segment of a word, e.g. 小学校. The reading for this is 小[しょう]学[がっ]校[こう] - the reading for 学 is an onbin variation on its onyomi: ガク. (Note: I'm using the terminology from the library - I'm not familiar enough with Japanese phonology to know what's actually happening on a linguistic level.)

Is this just irregular phonetic variation, or should surface_forms be modified so all but the last segment can undergo onbin variation?

Here is an MWE:

from cjktools.resources import kanjidic
from cjktools.alternations import surface_forms

from itertools import chain, product, starmap

kdict = kanjidic.Kanjidic()

def get_kanji_reading(compound, reading):
    kana_segments = product(*(kdict[kanji].all_readings for kanji in compound))

    for cand in chain.from_iterable(map(surface_forms, kana_segments)):
        if ''.join(cand) == reading:
            return tuple(zip(compound, cand))

    return ((compound, reading),)

pairs = [(u'小学校', u'しょうがっこう'), (u'醸造', u'じょうぞう')]

for compound, reading in pairs:
    print(''.join(starmap('{}[{}]'.format,
                          get_kanji_reading(compound, reading))))

Which returns:

小学校[しょうがっこう]
醸[じょう]造[ぞう]

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

surface_forms does not properly pick out the reading for '小学校' #6

surface_forms does not properly pick out the reading for '小学校' #6

pganssle commented Apr 30, 2016 •

edited

Loading

surface_forms does not properly pick out the reading for '小学校' #6

surface_forms does not properly pick out the reading for '小学校' #6

Comments

pganssle commented Apr 30, 2016 • edited Loading

pganssle commented Apr 30, 2016 •

edited

Loading