left-id and right-id values in extras/reiwa.33.csv for UniDic 2.1.2 not 2.3.0? #8

tyknkd · 2021-02-22T10:45:36Z

Maybe I'm missing something, or maybe it's not important, but I just noticed that the left and right context ID values in the extras/reiwa.33.csv seem to be for UniDic 2.1.2.

If I'm not mistaken the corresponding values from left-id.def and right-id.def for UniDic CJW 2.3.0 should be:

left-id.def: 14629 名詞,固有名詞,一般,*,*,*,*,*,固,*,*,*,"1,0",*,*
right-id.def: 15402 名詞,固有名詞,一般,*,*,*,*,*,固,*,*,*,"1,0",*,*

left-id.def: 18255 補助記号,一般,*,*,*,*,*,*,記号,*,*,*,*,*,*
right-id.def: 20453 補助記号,一般,*,*,*,*,*,*,記号,*,*,*,*,*,*

The text was updated successfully, but these errors were encountered:

polm · 2021-02-24T12:37:32Z

Huh, thanks for pointing that out. I initially made the file for unidic-lite and must have just copied them over without thinking about it.

I think those fields are not actually used unless you train MeCab with an HMM model, which would mean that they're never used in practice, but I can't find where I read this. I believe it was in Kudo's book (形態素解析の理論と実装) but I couldn't find it with a quick check of the index so I'll have to do some digging.

I will get back to this but it might take a while. Even if this is technically not correct and needs fixing, I think it has no effect on output, but if you encounter anything weird as a result of it let me know.

tyknkd · 2021-02-25T02:01:07Z

I might have misunderstood, but I believe that this page shows that the context IDs are used for parsing text. No?

polm · 2021-03-08T09:17:08Z

So looking at this in more detail, you are right.

What I was remembering is a section at the bottom of page 99 in Kudo's book where he mentions that having separate left and right contexts is redundant from a data structure perspective, and not necessary for CRFs. It has nothing to do with whether or not they are used in cost calculations.

Fortunately this doesn't seem to be having any negative effects, but I'll put a fix in. Thanks for pointing this out!

tyknkd · 2021-03-08T09:28:12Z

No problem! Thank you for providing this amazing resource!

polm · 2021-03-14T04:48:24Z

I'm testing the change on a Wikipedia dump and there are differences. Here's the first example I've found: 官報令和元年.

Before fix:

官報    名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
令和    名詞,固有名詞,一般,,,,レイワ,令和,令和,レーワ,令和,レーワ,固,,,,,,,,レイワ,レイワ,レイワ,レイワ,1,0,,,,
元年    名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

After fix:

官報    名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
令      接尾辞,名詞的,一般,,,,レイ,令,令,レー,令,レー,漢,,,,,,,接尾体,レイ,レイ,レイ,レイ,,C3,,11147407261835776,40554
和      名詞,普通名詞,一般,,,,ワ,和,和,ワ,和,ワ,漢,,,,,,,体,ワ,ワ,ワ,ワ,1,C3,,11298315232748032,41103
元年    名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

Notice that the results is actually better before the fix. That's surprising.

Also note I confirmed this is not merely a case of the "fixed" dictionary accidentally not containing the Reiwa entries - it handles 令和元年 without issue.

I'll keep examining the differences.

tyknkd · 2021-03-14T07:29:06Z

Interesting! How did you determine the cost setting for 令和?
Could the reversal of the left and right context IDs described here also have an influence?

polm · 2021-03-14T13:05:51Z

The cost is modeled on the cost for 昭和.

I hadn't seen that question you link to but it looks like the issue is the same as the one described here. If I understand that correctly, the bug is in the size function, but not in the cost lookup, so it can cause dictionary building to fail but doesn't affect tokenization. So it's not relevant to this issue.

tyknkd · 2021-03-19T05:52:10Z

Interestingly, if you parse 官報昭和元年, 昭和 is not split like 令和:

官報,16116,17410,5130,名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
昭和,14771,15544,5952,名詞,固有名詞,地名,一般,,,ショウワ,ショウワ,昭和,ショーワ,昭和,ショーワ,固,,,,,,,地名,ショウワ,ショウワ,ショウワ,ショウワ,1,0,,,4644895495168512,16898
元年,16284,17783,2496,名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

The 昭和 lemma parsed from 官報昭和元年 appears to be classed as a placename (地名) and the cost (5952) is different than the cost for 令和 in reiwa.33.csv (8205).

If you enter just 昭和 by itself, a different lemma is displayed, which again has a different cost (3179) than the one for 令和:

昭和,14625,15398,3179,名詞,固有名詞,一般,,,,ショウワ,昭和,昭和,ショーワ,昭和,ショーワ,固,,,,,,,固有名,ショウワ,ショウワ,ショウワ,ショウワ,0,1,,,4644620617261568,16897

Compare with 令和:

令和,14629,15402,8205,名詞,固有名詞,一般,,,,レイワ,令和,令和,レーワ,令和,レーワ,固,,,,,,,,レイワ,レイワ,レイワ,レイワ,1,0,,,,

The costs for 令 and 和 separately are lower than for 令和:

令,17930,20045,6087,接尾辞,名詞的,一般,,,,レイ,令,令,レー,令,レー,漢,,,,,,,接尾体,レイ,レイ,レイ,レイ,,C3,,11147407261835776,40554
和,16118,17412,5213,名詞,普通名詞,一般,,,,ワ,和,和,ワ,和,ワ,漢,,,,,,,体,ワ,ワ,ワ,ワ,1,C3,,11298315232748032,41103

Did the cost for 昭和 which you used as the model for 令和 come from the UniDic 2.1.2 dictionary?

tyknkd · 2021-03-19T06:18:17Z

If I change the left/right context IDs and cost for 令和 to match those for the non-placename 昭和 lemma in UniDic 2.3.0, then 官報令和元年 parses as expected:

官報,16116,17410,5130,名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
令和,14625,15398,3179,名詞,固有名詞,一般,,,,レイワ,令和,令和,レーワ,令和,レーワ,固,,,,,,,,レイワ,レイワ,レイワ,レイワ,0,1,,,,
元年,16284,17783,2496,名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

polm · 2021-07-02T08:01:48Z

Thanks for the extra info on this, and sorry I haven't gotten around to dealing with it yet. I think there was public mention of the next UniDic release being in the works and I hoped that would happen soon enough I wouldn't have to investigate this.

I'll try to give this a proper look over soon.

polm · 2021-08-31T13:58:20Z

Just a followup on my last comment - v3.1.0 was released in April and I think with the last post I had forgotten about it for some reason. Since the new version is available I'm going to focus on releasing that and won't be investigating this further.

You can install the alpha release of 3.1.0 with python -m unidic download 3.1.0a1.

In the new version 令和 is included by default so this is a non-issue. The only place I had to examine left/right ids was in editing unk.def so that SYMBOL unks were puncutation; I was able to copy these values from the DEFAULT category so they'll be correct this time.

polm added a commit that referenced this issue Mar 8, 2021

Fix left/right context ids in Reiwa file (see #8)

46eabf7

polm closed this as completed Aug 31, 2021

togiso mentioned this issue Apr 11, 2023

error on making user dictionary. taku910/mecab#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

left-id and right-id values in extras/reiwa.33.csv for UniDic 2.1.2 not 2.3.0? #8

left-id and right-id values in extras/reiwa.33.csv for UniDic 2.1.2 not 2.3.0? #8

tyknkd commented Feb 22, 2021 •

edited

Loading

polm commented Feb 24, 2021

tyknkd commented Feb 25, 2021

polm commented Mar 8, 2021 •

edited

Loading

tyknkd commented Mar 8, 2021

polm commented Mar 14, 2021 •

edited

Loading

tyknkd commented Mar 14, 2021

polm commented Mar 14, 2021

tyknkd commented Mar 19, 2021 •

edited

Loading

tyknkd commented Mar 19, 2021

polm commented Jul 2, 2021

polm commented Aug 31, 2021

left-id and right-id values in extras/reiwa.33.csv for UniDic 2.1.2 not 2.3.0? #8

left-id and right-id values in extras/reiwa.33.csv for UniDic 2.1.2 not 2.3.0? #8

Comments

tyknkd commented Feb 22, 2021 • edited Loading

polm commented Feb 24, 2021

tyknkd commented Feb 25, 2021

polm commented Mar 8, 2021 • edited Loading

tyknkd commented Mar 8, 2021

polm commented Mar 14, 2021 • edited Loading

tyknkd commented Mar 14, 2021

polm commented Mar 14, 2021

tyknkd commented Mar 19, 2021 • edited Loading

tyknkd commented Mar 19, 2021

polm commented Jul 2, 2021

polm commented Aug 31, 2021

tyknkd commented Feb 22, 2021 •

edited

Loading

polm commented Mar 8, 2021 •

edited

Loading

polm commented Mar 14, 2021 •

edited

Loading

tyknkd commented Mar 19, 2021 •

edited

Loading