-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Min-counts #20
Comments
Dan, |
If you want to prune unigrams, that's something that can be done while Dan On Tue, Jun 28, 2016 at 7:32 AM, vince62s [email protected] wrote:
|
well my comment was for unigrams and bi-grams ... anyway this can be done differently. |
After last night fix it's running fine now. I am running some ppl and lm size right now to compare various situations. |
I'm adding a note here, although this is not really an 'issue' in the normal sense.
I just checked in code that supports enforcing min-counts. This should make the process of building and pruning LMs about twice faster without affecting perplexity results much.
@chris920820 and @keli78, can you please test this?
It's done at the stage of get_counts.sh (you should now use get_counts.py, which supports the --min-counts option). Following are some experiments to do, e.g. on the Switchboard+Fisher setup are as follows:
Try this for two settings: (a) min-counts=2, (b) min-count for fisher=2, swbd=1
[these min-counts will be applied for orders 3 and higher].
Don't bother testing decoding using these LMs versus no-mincount ones for different pruning thresholds, as the differences will likely be too small to measure. But you could do an experiment where you do rescoring with the full no-mincount vs with-mincount LMs, and see if the WER is affected [which is unlikely].
You may discover some bugs as you do this.
@chris920820, you could perhaps make a pull request where you replace instances of get_counts.sh with get_counts.py-- I know you already did this, but that pull request is now out of date. Let's wait a bit before removing the old script get_counts.sh.
Dan
The text was updated successfully, but these errors were encountered: