Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[distsim] BAP generates almost no entries for German, CONLL data. #288

Open
gilnoh opened this issue Nov 6, 2013 · 11 comments
Open

[distsim] BAP generates almost no entries for German, CONLL data. #288

gilnoh opened this issue Nov 6, 2013 · 11 comments
Assignees

Comments

@gilnoh
Copy link
Member

gilnoh commented Nov 6, 2013

(This reports an issue of distsim running on German corpus)

(Note, BAP-GER configuration requires minor update for its redis-model path.There were some not-up-to-date pathes in the configuration. Had to fixed them to run (/bap/ to /bap-ger/), this wasn't a problem.)

However, even after fixing those path, it generates only really small redis-db.
Even though BAP-GER model generation and redis copy is all finished, it seems that something went a miss and BAP has almost no entry.

data in /models/bap-ger/ are very small. For example, elements-similarities-left-bap only 9560 bytes, elements-similarities-right-bap only 119656 bytes. Far different from 600Mbytes of lin data.

something is not okay here. list of contents in the generated "models" directory of "bap-ger".
-rw-r--r-- 1 noh mitarb 2887872457 Nov 4 17:50 cooccurrences
-rw-r--r-- 1 noh mitarb 48872100 Nov 4 17:57 element-feature-counts
-rw-r--r-- 1 noh mitarb 87471890 Nov 4 17:58 element-feature-scores-pmi
-rw-r--r-- 1 noh mitarb 21785161 Nov 4 17:57 elements
-rw-r--r-- 1 noh mitarb 556338 Nov 4 17:58 element-scores-pmi
-rw-r--r-- 1 noh mitarb 525029405 Nov 4 20:24 elements-similarities-left-apinc
-rw-r--r-- 1 noh mitarb 9560 Nov 4 20:25 elements-similarities-left-bap
-rw-r--r-- 1 noh mitarb 597556654 Nov 4 18:29 elements-similarities-left-lin
-rw-r--r-- 1 noh mitarb 504753695 Nov 4 20:24 elements-similarities-right-apinc
-rw-r--r-- 1 noh mitarb 119656 Nov 4 20:24 elements-similarities-right-bap
-rw-r--r-- 1 noh mitarb 597556654 Nov 4 18:29 elements-similarities-right-lin
-rw-r--r-- 1 noh mitarb 27719715 Nov 4 17:57 feature-elements
-rw-r--r-- 1 noh mitarb 1021943559 Nov 4 17:57 features
-rw-r--r-- 1 noh mitarb 692765623 Nov 4 17:50 textunits

and the resulting REDIS DB is way too small (compared to others like LIN)
-rw-r--r-- 1 noh mitarb 9850 Nov 5 12:23 similarity-l2r.rdb
-rw-r--r-- 1 noh mitarb 146988 Nov 5 12:23 similarity-r2l.rdb

(for LIN-prox generated from the same corpus; just comparison for the same corpus. )
-rw-r--r-- 1 noh mitarb 803511284 Oct 31 12:20 similarity-l2r.rdb
-rw-r--r-- 1 noh mitarb 786105200 Oct 31 12:35 similarity-r2l.rdb

@ghost ghost assigned adlerm Nov 6, 2013
@gilnoh
Copy link
Member Author

gilnoh commented Nov 6, 2013

You can reproduce this with the following intermediate size corpus: (1/30th of SDEWAC)
http://www.cl.uni-heidelberg.de/~noh/sdewac_part01.mstparsed.utf8.conll.gz

@adlerm
Copy link
Contributor

adlerm commented Nov 12, 2013

Put the vector-truncate module, at element-feature-scoring-memory-pmi.xml configuration, in a comment. And run again the bap generation

@gilnoh
Copy link
Member Author

gilnoh commented Nov 12, 2013

Thanks! I will try this out, and let you know the result ASAP. (currently our server is occupied by someone else and I have to wait a bit!)

@adlerm
Copy link
Contributor

adlerm commented Nov 13, 2013

In addition, ensure that the min-count feature at the 'lemma-pos-extraction' module has the same value in both lin and bap element-feature-counting-memory.xml configuration file

@gilnoh
Copy link
Member Author

gilnoh commented Nov 13, 2013

In addition, ensure that the min-count feature at the 'lemma-pos-extraction' module has the same value in both lin and bap element-feature-counting-memory.xml configuration file

They were always the same (both 10). Never changed. So I am pretty sure this is not relevant.

@gilnoh
Copy link
Member Author

gilnoh commented Nov 13, 2013

I just finished one test with the following new configuration change.

Put the vector-truncate module, at element-feature-scoring-memory-pmi.xml configuration, in a comment. And run again the bap generation

However, it didn't resulted better. Here's the size of generated model files.
/demo/models/bap-ger
-rw-r--r-- 1 noh mitarb 2887872457 Nov 13 10:58 cooccurrences
-rw-r--r-- 1 noh mitarb 48872100 Nov 13 11:04 element-feature-counts
-rw-r--r-- 1 noh mitarb 96371157 Nov 13 11:05 element-feature-scores-pmi
-rw-r--r-- 1 noh mitarb 21785161 Nov 13 11:03 elements
-rw-r--r-- 1 noh mitarb 556354 Nov 13 11:05 element-scores-pmi
-rw-r--r-- 1 noh mitarb 524316385 Nov 13 15:51 elements-similarities-left-apinc
-rw-r--r-- 1 noh mitarb 4145 Nov 13 15:52 elements-similarities-left-bap
-rw-r--r-- 1 noh mitarb 597518405 Nov 13 11:31 elements-similarities-left-lin
-rw-r--r-- 1 noh mitarb 504947377 Nov 13 15:51 elements-similarities-right-apinc
-rw-r--r-- 1 noh mitarb 34486 Nov 13 15:51 elements-similarities-right-bap
-rw-r--r-- 1 noh mitarb 597518405 Nov 13 11:31 elements-similarities-right-lin
-rw-r--r-- 1 noh mitarb 27719715 Nov 13 11:04 feature-elements
-rw-r--r-- 1 noh mitarb 1021943559 Nov 13 11:04 features
-rw-r--r-- 1 noh mitarb 692765623 Nov 13 10:55 textunits

As you see, elements-similarities-left-bap and elements-similarities-right-bap are extremely small. And something is definitely wrong -- not just configuration problem, but something isn't working due to bug, I guess...

@adlerm
Copy link
Contributor

adlerm commented Nov 13, 2013

very strange - this is what I've got for the same setting (on this db:
http://www.cl.uni-heidelberg.de/~noh/sdewac_part01.mstparsed.utf8.conll.gz):

[adlerm6@te-srv2 ~]$ ll workspace/db/directional/
total 9252980
-rw-r--r--. 1 adlerm6 ir 2887103643 Nov 11 15:49 cooccurrences
-rw-r--r--. 1 adlerm6 ir 48856113 Nov 13 01:15 element-feature-counts
-rw-r--r--. 1 adlerm6 ir 87441253 Nov 13 07:42 element-feature-scores-pmi
-rw-r--r--. 1 adlerm6 ir 555841 Nov 13 07:42 element-scores-pmi
-rw-r--r--. 1 adlerm6 ir 21792311 Nov 13 01:15 elements
-rw-r--r--. 1 adlerm6 ir 524563348 Nov 13 09:54
elements-similarities-left-apinc
-rw-r--r--. 1 adlerm6 ir 524563348 Nov 13 09:55
elements-similarities-left-apinc.sorted
-rw-r--r--. 1 adlerm6 ir 50956965 Nov 13 09:56
elements-similarities-left-bap
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 08:06
elements-similarities-left-lin
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 09:55
elements-similarities-left-lin.sorted
-rw-r--r--. 1 adlerm6 ir 504357334 Nov 13 09:54
elements-similarities-right-apinc
-rw-r--r--. 1 adlerm6 ir 504357334 Nov 13 09:54
elements-similarities-right-apinc.sorted
-rw-r--r--. 1 adlerm6 ir 189246395 Nov 13 09:55
elements-similarities-right-bap
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 08:06
elements-similarities-right-lin
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 09:54
elements-similarities-right-lin.sorted
-rw-r--r--. 1 adlerm6 ir 27706587 Nov 13 01:15 feature-elements
-rw-r--r--. 1 adlerm6 ir 1022359819 Nov 13 01:15 features
-rw-r--r--. 1 adlerm6 ir 693160383 Nov 11 15:49 textunits

On Wed, Nov 13, 2013 at 4:59 PM, Tae-Gil Noh [email protected]:

I just finished one test with the following new configuration change.

Put the vector-truncate module, at element-feature-scoring-memory-pmi.xml
configuration, in a comment. And run again the bap generation

However, it didn't resulted better. Here's the size of generated model
files.
/demo/models/bap-ger
-rw-r--r-- 1 noh mitarb 2887872457 Nov 13 10:58 cooccurrences
-rw-r--r-- 1 noh mitarb 48872100 Nov 13 11:04 element-feature-counts
-rw-r--r-- 1 noh mitarb 96371157 Nov 13 11:05 element-feature-scores-pmi
-rw-r--r-- 1 noh mitarb 21785161 Nov 13 11:03 elements
-rw-r--r-- 1 noh mitarb 556354 Nov 13 11:05 element-scores-pmi
-rw-r--r-- 1 noh mitarb 524316385 Nov 13 15:51
elements-similarities-left-apinc
-rw-r--r-- 1 noh mitarb 4145 Nov 13 15:52 elements-similarities-left-bap
-rw-r--r-- 1 noh mitarb 597518405 Nov 13 11:31
elements-similarities-left-lin
-rw-r--r-- 1 noh mitarb 504947377 Nov 13 15:51
elements-similarities-right-apinc
-rw-r--r-- 1 noh mitarb 34486 Nov 13 15:51 elements-similarities-right-bap
-rw-r--r-- 1 noh mitarb 597518405 Nov 13 11:31
elements-similarities-right-lin
-rw-r--r-- 1 noh mitarb 27719715 Nov 13 11:04 feature-elements
-rw-r--r-- 1 noh mitarb 1021943559 Nov 13 11:04 features
-rw-r--r-- 1 noh mitarb 692765623 Nov 13 10:55 textunits

As you see, elements-similarities-left-bap and
elements-similarities-right-bap are extremely small. And something is
definitely wrong -- not just configuration problem, but something isn't
working due to bug, I guess...


Reply to this email directly or view it on GitHubhttps://github.com//issues/288#issuecomment-28400844
.

@adlerm
Copy link
Contributor

adlerm commented Nov 13, 2013

It should not matter, but did you use the new version of the distsim or the
previous one (I used the new one)?

On Wed, Nov 13, 2013 at 5:29 PM, Meni Adler [email protected] wrote:

very strange - this is what I've got for the same setting (on this db:
http://www.cl.uni-heidelberg.de/~noh/sdewac_part01.mstparsed.utf8.conll.gz
):

[adlerm6@te-srv2 ~]$ ll workspace/db/directional/
total 9252980
-rw-r--r--. 1 adlerm6 ir 2887103643 Nov 11 15:49 cooccurrences
-rw-r--r--. 1 adlerm6 ir 48856113 Nov 13 01:15 element-feature-counts
-rw-r--r--. 1 adlerm6 ir 87441253 Nov 13 07:42 element-feature-scores-pmi
-rw-r--r--. 1 adlerm6 ir 555841 Nov 13 07:42 element-scores-pmi
-rw-r--r--. 1 adlerm6 ir 21792311 Nov 13 01:15 elements
-rw-r--r--. 1 adlerm6 ir 524563348 Nov 13 09:54
elements-similarities-left-apinc
-rw-r--r--. 1 adlerm6 ir 524563348 Nov 13 09:55
elements-similarities-left-apinc.sorted
-rw-r--r--. 1 adlerm6 ir 50956965 Nov 13 09:56
elements-similarities-left-bap
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 08:06
elements-similarities-left-lin
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 09:55
elements-similarities-left-lin.sorted
-rw-r--r--. 1 adlerm6 ir 504357334 Nov 13 09:54
elements-similarities-right-apinc
-rw-r--r--. 1 adlerm6 ir 504357334 Nov 13 09:54
elements-similarities-right-apinc.sorted
-rw-r--r--. 1 adlerm6 ir 189246395 Nov 13 09:55
elements-similarities-right-bap
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 08:06
elements-similarities-right-lin
-rw-r--r--. 1 adlerm6 ir 596987645 Nov 13 09:54
elements-similarities-right-lin.sorted
-rw-r--r--. 1 adlerm6 ir 27706587 Nov 13 01:15 feature-elements
-rw-r--r--. 1 adlerm6 ir 1022359819 Nov 13 01:15 features
-rw-r--r--. 1 adlerm6 ir 693160383 Nov 11 15:49 textunits

On Wed, Nov 13, 2013 at 4:59 PM, Tae-Gil Noh [email protected]:

I just finished one test with the following new configuration change.

Put the vector-truncate module, at element-feature-scoring-memory-pmi.xml
configuration, in a comment. And run again the bap generation

However, it didn't resulted better. Here's the size of generated model
files.
/demo/models/bap-ger
-rw-r--r-- 1 noh mitarb 2887872457 Nov 13 10:58 cooccurrences
-rw-r--r-- 1 noh mitarb 48872100 Nov 13 11:04 element-feature-counts
-rw-r--r-- 1 noh mitarb 96371157 Nov 13 11:05 element-feature-scores-pmi
-rw-r--r-- 1 noh mitarb 21785161 Nov 13 11:03 elements
-rw-r--r-- 1 noh mitarb 556354 Nov 13 11:05 element-scores-pmi
-rw-r--r-- 1 noh mitarb 524316385 Nov 13 15:51
elements-similarities-left-apinc
-rw-r--r-- 1 noh mitarb 4145 Nov 13 15:52 elements-similarities-left-bap
-rw-r--r-- 1 noh mitarb 597518405 Nov 13 11:31
elements-similarities-left-lin
-rw-r--r-- 1 noh mitarb 504947377 Nov 13 15:51
elements-similarities-right-apinc
-rw-r--r-- 1 noh mitarb 34486 Nov 13 15:51 elements-similarities-right-bap
-rw-r--r-- 1 noh mitarb 597518405 Nov 13 11:31
elements-similarities-right-lin
-rw-r--r-- 1 noh mitarb 27719715 Nov 13 11:04 feature-elements
-rw-r--r-- 1 noh mitarb 1021943559 Nov 13 11:04 features
-rw-r--r-- 1 noh mitarb 692765623 Nov 13 10:55 textunits

As you see, elements-similarities-left-bap and
elements-similarities-right-bap are extremely small. And something is
definitely wrong -- not just configuration problem, but something isn't
working due to bug, I guess...


Reply to this email directly or view it on GitHubhttps://github.com//issues/288#issuecomment-28400844
.

@adlerm
Copy link
Contributor

adlerm commented Nov 13, 2013

OK - got it

There following feature should be added to element-similarity-combiner in both element-similarity-combination-left.xml and element-similarity-combination-right.xml:

false

You can return the vector-truncate module, at element-feature-scoring-memory-pmi.xml configuration (it will be more efficient)

I'm updating the correct configuration files at the new distsim demo - I'll let you know, so you can redownload and use it.

I'm sorry for this - thanks!

@adlerm
Copy link
Contributor

adlerm commented Nov 13, 2013

The demo package contains now the fixed configuration files
http://hlt-services4.fbk.eu:8080/artifactory/simple/private-internal/BIU/exci-dist-sim/4

@gilnoh
Copy link
Member Author

gilnoh commented Nov 13, 2013

Meni, thanks a lot for debuging and resolving this issue! I will download the new version, regenerate BAP and close this issue. Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants