-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[distsim] BAP generates almost no entries for German, CONLL data. #288
Comments
You can reproduce this with the following intermediate size corpus: (1/30th of SDEWAC) |
Put the vector-truncate module, at element-feature-scoring-memory-pmi.xml configuration, in a comment. And run again the bap generation |
Thanks! I will try this out, and let you know the result ASAP. (currently our server is occupied by someone else and I have to wait a bit!) |
In addition, ensure that the min-count feature at the 'lemma-pos-extraction' module has the same value in both lin and bap element-feature-counting-memory.xml configuration file |
They were always the same (both 10). Never changed. So I am pretty sure this is not relevant. |
I just finished one test with the following new configuration change.
However, it didn't resulted better. Here's the size of generated model files. As you see, elements-similarities-left-bap and elements-similarities-right-bap are extremely small. And something is definitely wrong -- not just configuration problem, but something isn't working due to bug, I guess... |
very strange - this is what I've got for the same setting (on this db: [adlerm6@te-srv2 ~]$ ll workspace/db/directional/ On Wed, Nov 13, 2013 at 4:59 PM, Tae-Gil Noh [email protected]:
|
It should not matter, but did you use the new version of the distsim or the On Wed, Nov 13, 2013 at 5:29 PM, Meni Adler [email protected] wrote:
|
OK - got it There following feature should be added to element-similarity-combiner in both element-similarity-combination-left.xml and element-similarity-combination-right.xml: falseYou can return the vector-truncate module, at element-feature-scoring-memory-pmi.xml configuration (it will be more efficient) I'm updating the correct configuration files at the new distsim demo - I'll let you know, so you can redownload and use it. I'm sorry for this - thanks! |
The demo package contains now the fixed configuration files |
Meni, thanks a lot for debuging and resolving this issue! I will download the new version, regenerate BAP and close this issue. Nice work! |
(This reports an issue of distsim running on German corpus)
(Note, BAP-GER configuration requires minor update for its redis-model path.There were some not-up-to-date pathes in the configuration. Had to fixed them to run (/bap/ to /bap-ger/), this wasn't a problem.)
However, even after fixing those path, it generates only really small redis-db.
Even though BAP-GER model generation and redis copy is all finished, it seems that something went a miss and BAP has almost no entry.
data in /models/bap-ger/ are very small. For example, elements-similarities-left-bap only 9560 bytes, elements-similarities-right-bap only 119656 bytes. Far different from 600Mbytes of lin data.
something is not okay here. list of contents in the generated "models" directory of "bap-ger".
-rw-r--r-- 1 noh mitarb 2887872457 Nov 4 17:50 cooccurrences
-rw-r--r-- 1 noh mitarb 48872100 Nov 4 17:57 element-feature-counts
-rw-r--r-- 1 noh mitarb 87471890 Nov 4 17:58 element-feature-scores-pmi
-rw-r--r-- 1 noh mitarb 21785161 Nov 4 17:57 elements
-rw-r--r-- 1 noh mitarb 556338 Nov 4 17:58 element-scores-pmi
-rw-r--r-- 1 noh mitarb 525029405 Nov 4 20:24 elements-similarities-left-apinc
-rw-r--r-- 1 noh mitarb 9560 Nov 4 20:25 elements-similarities-left-bap
-rw-r--r-- 1 noh mitarb 597556654 Nov 4 18:29 elements-similarities-left-lin
-rw-r--r-- 1 noh mitarb 504753695 Nov 4 20:24 elements-similarities-right-apinc
-rw-r--r-- 1 noh mitarb 119656 Nov 4 20:24 elements-similarities-right-bap
-rw-r--r-- 1 noh mitarb 597556654 Nov 4 18:29 elements-similarities-right-lin
-rw-r--r-- 1 noh mitarb 27719715 Nov 4 17:57 feature-elements
-rw-r--r-- 1 noh mitarb 1021943559 Nov 4 17:57 features
-rw-r--r-- 1 noh mitarb 692765623 Nov 4 17:50 textunits
and the resulting REDIS DB is way too small (compared to others like LIN)
-rw-r--r-- 1 noh mitarb 9850 Nov 5 12:23 similarity-l2r.rdb
-rw-r--r-- 1 noh mitarb 146988 Nov 5 12:23 similarity-r2l.rdb
(for LIN-prox generated from the same corpus; just comparison for the same corpus. )
-rw-r--r-- 1 noh mitarb 803511284 Oct 31 12:20 similarity-l2r.rdb
-rw-r--r-- 1 noh mitarb 786105200 Oct 31 12:35 similarity-r2l.rdb
The text was updated successfully, but these errors were encountered: