-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
left-id and right-id values in extras/reiwa.33.csv for UniDic 2.1.2 not 2.3.0? #8
Comments
Huh, thanks for pointing that out. I initially made the file for I think those fields are not actually used unless you train MeCab with an HMM model, which would mean that they're never used in practice, but I can't find where I read this. I believe it was in Kudo's book (形態素解析の理論と実装) but I couldn't find it with a quick check of the index so I'll have to do some digging. I will get back to this but it might take a while. Even if this is technically not correct and needs fixing, I think it has no effect on output, but if you encounter anything weird as a result of it let me know. |
I might have misunderstood, but I believe that this page shows that the context IDs are used for parsing text. No? |
So looking at this in more detail, you are right. What I was remembering is a section at the bottom of page 99 in Kudo's book where he mentions that having separate left and right contexts is redundant from a data structure perspective, and not necessary for CRFs. It has nothing to do with whether or not they are used in cost calculations. Fortunately this doesn't seem to be having any negative effects, but I'll put a fix in. Thanks for pointing this out! |
No problem! Thank you for providing this amazing resource! |
I'm testing the change on a Wikipedia dump and there are differences. Here's the first example I've found: Before fix:
After fix:
Notice that the results is actually better before the fix. That's surprising. Also note I confirmed this is not merely a case of the "fixed" dictionary accidentally not containing the Reiwa entries - it handles 令和元年 without issue. I'll keep examining the differences. |
Interesting! How did you determine the cost setting for 令和? |
The cost is modeled on the cost for 昭和. I hadn't seen that question you link to but it looks like the issue is the same as the one described here. If I understand that correctly, the bug is in the size function, but not in the cost lookup, so it can cause dictionary building to fail but doesn't affect tokenization. So it's not relevant to this issue. |
Interestingly, if you parse
The If you enter just
Compare with
The costs for
Did the cost for |
If I change the left/right context IDs and cost for
|
Thanks for the extra info on this, and sorry I haven't gotten around to dealing with it yet. I think there was public mention of the next UniDic release being in the works and I hoped that would happen soon enough I wouldn't have to investigate this. I'll try to give this a proper look over soon. |
Just a followup on my last comment - v3.1.0 was released in April and I think with the last post I had forgotten about it for some reason. Since the new version is available I'm going to focus on releasing that and won't be investigating this further. You can install the alpha release of 3.1.0 with In the new version 令和 is included by default so this is a non-issue. The only place I had to examine left/right ids was in editing |
Maybe I'm missing something, or maybe it's not important, but I just noticed that the left and right context ID values in the extras/reiwa.33.csv seem to be for UniDic 2.1.2.
If I'm not mistaken the corresponding values from left-id.def and right-id.def for UniDic CJW 2.3.0 should be:
The text was updated successfully, but these errors were encountered: