Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Min-counts #20

Open
danpovey opened this issue Jun 25, 2016 · 4 comments
Open

Min-counts #20

danpovey opened this issue Jun 25, 2016 · 4 comments

Comments

@danpovey
Copy link
Owner

I'm adding a note here, although this is not really an 'issue' in the normal sense.

I just checked in code that supports enforcing min-counts. This should make the process of building and pruning LMs about twice faster without affecting perplexity results much.
@chris920820 and @keli78, can you please test this?
It's done at the stage of get_counts.sh (you should now use get_counts.py, which supports the --min-counts option). Following are some experiments to do, e.g. on the Switchboard+Fisher setup are as follows:
Try this for two settings: (a) min-counts=2, (b) min-count for fisher=2, swbd=1
[these min-counts will be applied for orders 3 and higher].

  • See how much faster the LM estimation is than before. And check that the process of getting the counts does not become too slow (increase the --num-jobs to get_counts.py if it does).
  • See how the min-counts affect the curve of LM size versus perplexity on dev data as you prune with various thresholds.

Don't bother testing decoding using these LMs versus no-mincount ones for different pruning thresholds, as the differences will likely be too small to measure. But you could do an experiment where you do rescoring with the full no-mincount vs with-mincount LMs, and see if the WER is affected [which is unlikely].

You may discover some bugs as you do this.
@chris920820, you could perhaps make a pull request where you replace instances of get_counts.sh with get_counts.py-- I know you already did this, but that pull request is now out of date. Let's wait a bit before removing the old script get_counts.sh.

Dan

@vince62s
Copy link
Contributor

Dan,
I am sure you applied the min-counts to order 3 and above to replicate the SRILM behavior, but I really think pruning also lower order ie unigram and bi-gram could be helpful. It does not make sense to keep typos or such in the LM. For your info, Ken Heafield made the change and KenLM supports now unigram pruning.

@danpovey
Copy link
Owner Author

If you want to prune unigrams, that's something that can be done while
preparing the word-list.
The reason why I disallowed pruning bigram counts is that it would have
required changes elsewhere in the toolkit [and anyway they can be removed
as part of the entropy-pruning operation later on].

Dan

On Tue, Jun 28, 2016 at 7:32 AM, vince62s [email protected] wrote:

Dan,
I am sure you applied the min-counts to order 3 and above to replicate the
SRILM behavior, but I really think pruning also lower order ie unigram and
bi-gram could be helpful. It does not make sense to keep typos or such in
the LM. For your info, Ken Heafield made the change and KenLM supports now
unigram pruning.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#20 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ADJVu6MBZGhxHCttB9ausUyBXHl3q0ugks5qQTBmgaJpZM4I-cn3
.

@vince62s
Copy link
Contributor

well my comment was for unigrams and bi-grams ... anyway this can be done differently.

@vince62s
Copy link
Contributor

vince62s commented Jul 1, 2016

After last night fix it's running fine now. I am running some ppl and lm size right now to compare various situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants