Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

left-id and right-id values in extras/reiwa.33.csv for UniDic 2.1.2 not 2.3.0? #8

Closed
tyknkd opened this issue Feb 22, 2021 · 11 comments

Comments

@tyknkd
Copy link

tyknkd commented Feb 22, 2021

Maybe I'm missing something, or maybe it's not important, but I just noticed that the left and right context ID values in the extras/reiwa.33.csv seem to be for UniDic 2.1.2.

If I'm not mistaken the corresponding values from left-id.def and right-id.def for UniDic CJW 2.3.0 should be:

left-id.def: 14629 名詞,固有名詞,一般,*,*,*,*,*,固,*,*,*,"1,0",*,*
right-id.def: 15402 名詞,固有名詞,一般,*,*,*,*,*,固,*,*,*,"1,0",*,*

left-id.def: 18255 補助記号,一般,*,*,*,*,*,*,記号,*,*,*,*,*,*
right-id.def: 20453 補助記号,一般,*,*,*,*,*,*,記号,*,*,*,*,*,*
@polm
Copy link
Owner

polm commented Feb 24, 2021

Huh, thanks for pointing that out. I initially made the file for unidic-lite and must have just copied them over without thinking about it.

I think those fields are not actually used unless you train MeCab with an HMM model, which would mean that they're never used in practice, but I can't find where I read this. I believe it was in Kudo's book (形態素解析の理論と実装) but I couldn't find it with a quick check of the index so I'll have to do some digging.

I will get back to this but it might take a while. Even if this is technically not correct and needs fixing, I think it has no effect on output, but if you encounter anything weird as a result of it let me know.

@tyknkd
Copy link
Author

tyknkd commented Feb 25, 2021

I might have misunderstood, but I believe that this page shows that the context IDs are used for parsing text. No?

@polm
Copy link
Owner

polm commented Mar 8, 2021

So looking at this in more detail, you are right.

What I was remembering is a section at the bottom of page 99 in Kudo's book where he mentions that having separate left and right contexts is redundant from a data structure perspective, and not necessary for CRFs. It has nothing to do with whether or not they are used in cost calculations.

Fortunately this doesn't seem to be having any negative effects, but I'll put a fix in. Thanks for pointing this out!

@tyknkd
Copy link
Author

tyknkd commented Mar 8, 2021

No problem! Thank you for providing this amazing resource!

@polm
Copy link
Owner

polm commented Mar 14, 2021

I'm testing the change on a Wikipedia dump and there are differences. Here's the first example I've found: 官報令和元年.

Before fix:

官報    名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
令和    名詞,固有名詞,一般,,,,レイワ,令和,令和,レーワ,令和,レーワ,固,,,,,,,,レイワ,レイワ,レイワ,レイワ,1,0,,,,
元年    名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

After fix:

官報    名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
令      接尾辞,名詞的,一般,,,,レイ,令,令,レー,令,レー,漢,,,,,,,接尾体,レイ,レイ,レイ,レイ,,C3,,11147407261835776,40554
和      名詞,普通名詞,一般,,,,ワ,和,和,ワ,和,ワ,漢,,,,,,,体,ワ,ワ,ワ,ワ,1,C3,,11298315232748032,41103
元年    名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

Notice that the results is actually better before the fix. That's surprising.

Also note I confirmed this is not merely a case of the "fixed" dictionary accidentally not containing the Reiwa entries - it handles 令和元年 without issue.

I'll keep examining the differences.

@tyknkd
Copy link
Author

tyknkd commented Mar 14, 2021

Interesting! How did you determine the cost setting for 令和?
Could the reversal of the left and right context IDs described here also have an influence?

@polm
Copy link
Owner

polm commented Mar 14, 2021

The cost is modeled on the cost for 昭和.

I hadn't seen that question you link to but it looks like the issue is the same as the one described here. If I understand that correctly, the bug is in the size function, but not in the cost lookup, so it can cause dictionary building to fail but doesn't affect tokenization. So it's not relevant to this issue.

@tyknkd
Copy link
Author

tyknkd commented Mar 19, 2021

Interestingly, if you parse 官報昭和元年, 昭和 is not split like 令和:

官報,16116,17410,5130,名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
昭和,14771,15544,5952,名詞,固有名詞,地名,一般,,,ショウワ,ショウワ,昭和,ショーワ,昭和,ショーワ,固,,,,,,,地名,ショウワ,ショウワ,ショウワ,ショウワ,1,0,,,4644895495168512,16898
元年,16284,17783,2496,名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

The 昭和 lemma parsed from 官報昭和元年 appears to be classed as a placename (地名) and the cost (5952) is different than the cost for 令和 in reiwa.33.csv (8205).

If you enter just 昭和 by itself, a different lemma is displayed, which again has a different cost (3179) than the one for 令和:

昭和,14625,15398,3179,名詞,固有名詞,一般,,,,ショウワ,昭和,昭和,ショーワ,昭和,ショーワ,固,,,,,,,固有名,ショウワ,ショウワ,ショウワ,ショウワ,0,1,,,4644620617261568,16897

Compare with 令和:

令和,14629,15402,8205,名詞,固有名詞,一般,,,,レイワ,令和,令和,レーワ,令和,レーワ,固,,,,,,,,レイワ,レイワ,レイワ,レイワ,1,0,,,,

The costs for and separately are lower than for 令和:

令,17930,20045,6087,接尾辞,名詞的,一般,,,,レイ,令,令,レー,令,レー,漢,,,,,,,接尾体,レイ,レイ,レイ,レイ,,C3,,11147407261835776,40554
和,16118,17412,5213,名詞,普通名詞,一般,,,,ワ,和,和,ワ,和,ワ,漢,,,,,,,体,ワ,ワ,ワ,ワ,1,C3,,11298315232748032,41103

Did the cost for 昭和 which you used as the model for 令和 come from the UniDic 2.1.2 dictionary?

@tyknkd
Copy link
Author

tyknkd commented Mar 19, 2021

If I change the left/right context IDs and cost for 令和 to match those for the non-placename 昭和 lemma in UniDic 2.3.0, then 官報令和元年 parses as expected:

官報,16116,17410,5130,名詞,普通名詞,一般,,,,カンポウ,官報,官報,カンポー,官報,カンポー,漢,,,,,,,体,カンポウ,カンポウ,カンポウ,カンポウ,1,C1,,13578152592941568,49397
令和,14625,15398,3179,名詞,固有名詞,一般,,,,レイワ,令和,令和,レーワ,令和,レーワ,固,,,,,,,,レイワ,レイワ,レイワ,レイワ,0,1,,,,
元年,16284,17783,2496,名詞,普通名詞,副詞可能,,,,ガンネン,元年,元年,ガンネン,元年,ガンネン,漢,,,,,,,体,ガンネン,ガンネン,ガンネン,ガンネン,1,C1,,2258405507080704,8216

@polm
Copy link
Owner

polm commented Jul 2, 2021

Thanks for the extra info on this, and sorry I haven't gotten around to dealing with it yet. I think there was public mention of the next UniDic release being in the works and I hoped that would happen soon enough I wouldn't have to investigate this.

I'll try to give this a proper look over soon.

@polm
Copy link
Owner

polm commented Aug 31, 2021

Just a followup on my last comment - v3.1.0 was released in April and I think with the last post I had forgotten about it for some reason. Since the new version is available I'm going to focus on releasing that and won't be investigating this further.

You can install the alpha release of 3.1.0 with python -m unidic download 3.1.0a1.

In the new version 令和 is included by default so this is a non-issue. The only place I had to examine left/right ids was in editing unk.def so that SYMBOL unks were puncutation; I was able to copy these values from the DEFAULT category so they'll be correct this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants