[distsim] DIRT extraction --- very hard due to extreme memory usage. (German CONLL) #295

gilnoh · 2013-11-06T12:24:41Z

When tested with an intermediate size corpus (something like BNC size, 1/30 of DEWAC):

it required 67G bytes of virtual memory, although it only used 30G real memory.
In theory, this should be Okay. But it actually crashed our server with 50G memory --- due to other users who were running some parsers, etc. (The largest server that we have is 50G mem and 20G swap. So the DIRT running can crash the server. ) Note that we used only 1/30 DEWAC corpus.

We will try to run this again, but we have some serious questions.

Q: How to reduce memory footage?

will it be helpful if we raise the min count?
will it be helpful if we add "stopword" for German? (note that German stop word files are empty for now)
when crashed, already the Y elements model file grown around 10G. IMO, this is too big even for Redis ... is it not?
any other hints?

gilnoh · 2013-11-06T14:23:09Z

You can reproduce this memory requirement with the following intermediate size corpus: (1/30th of SDEWAC)
http://www.cl.uni-heidelberg.de/~noh/sdewac_part01.mstparsed.utf8.conll.gz

gilnoh · 2013-11-13T14:55:27Z

Ah, I see that for English, Meni used 2 CDs (about 1G?), with 64G memory. So I think 1G SDEWAC 1/30 parsed corpus would take about the same amount of memory.... hmm. I should try with a smaller corpus? Any parameter for automated (so it can has many parse errors) corpus? --- what I hope is to see more corpus, with higher threashold or something like that. Any advice would be nice!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[distsim] DIRT extraction --- very hard due to extreme memory usage. (German CONLL) #295

[distsim] DIRT extraction --- very hard due to extreme memory usage. (German CONLL) #295

gilnoh commented Nov 6, 2013

gilnoh commented Nov 6, 2013

gilnoh commented Nov 13, 2013

[distsim] DIRT extraction --- very hard due to extreme memory usage. (German CONLL) #295

[distsim] DIRT extraction --- very hard due to extreme memory usage. (German CONLL) #295

Comments

gilnoh commented Nov 6, 2013

gilnoh commented Nov 6, 2013

gilnoh commented Nov 13, 2013