Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[distsim] DIRT extraction --- very hard due to extreme memory usage. (German CONLL) #295

Open
gilnoh opened this issue Nov 6, 2013 · 2 comments
Labels

Comments

@gilnoh
Copy link
Member

gilnoh commented Nov 6, 2013

When tested with an intermediate size corpus (something like BNC size, 1/30 of DEWAC):

it required 67G bytes of virtual memory, although it only used 30G real memory.
In theory, this should be Okay. But it actually crashed our server with 50G memory --- due to other users who were running some parsers, etc. (The largest server that we have is 50G mem and 20G swap. So the DIRT running can crash the server. ) Note that we used only 1/30 DEWAC corpus.

We will try to run this again, but we have some serious questions.

Q: How to reduce memory footage?

  • will it be helpful if we raise the min count?
  • will it be helpful if we add "stopword" for German? (note that German stop word files are empty for now)
  • when crashed, already the Y elements model file grown around 10G. IMO, this is too big even for Redis ... is it not?
  • any other hints?
@gilnoh
Copy link
Member Author

gilnoh commented Nov 6, 2013

You can reproduce this memory requirement with the following intermediate size corpus: (1/30th of SDEWAC)
http://www.cl.uni-heidelberg.de/~noh/sdewac_part01.mstparsed.utf8.conll.gz

@gilnoh
Copy link
Member Author

gilnoh commented Nov 13, 2013

Ah, I see that for English, Meni used 2 CDs (about 1G?), with 64G memory. So I think 1G SDEWAC 1/30 parsed corpus would take about the same amount of memory.... hmm. I should try with a smaller corpus? Any parameter for automated (so it can has many parse errors) corpus? --- what I hope is to see more corpus, with higher threashold or something like that. Any advice would be nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant