-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This wiki provides instructions for reproducing results plus some additional output pertaining to the paper.
Go to the top-level directory topicmodel-eval
and run:
$ ./build compile
This may take a while as dependencies are downloaded and the code is compiled. If this succeeds, you are ready to move on.
Note: Unless indicated otherwise, the commands given below (and in sub-pages) assume you are in the top-level directory topicmodel-eval
.
We give instructions for running the code to produce the data and plots in the paper. Keep in mind that there is randomness in the algorithms, so the precise numbers you get won't be the same as ours. (Keep in mind that this paper is about the broad patterns observed with likelihoods, not specific values.)
- [Small simulation experiments](wiki/Reproducing Small Simulation Experiments)
- Larger simulation experiments
- Real corpora experiments
To get some measurements for the different corpora (Table 1 in the paper), run the following:
$ bin/tmeval corpus-stats
You'll see information pass by for each corpus, followed at the end by the following output:
pcl-travel,188765,4780051,469,367.51598604686575
sgu,26851,421621,472,678.2764922949932
20news,114547,2743124,145,353.43882073139616
reuters,43153,1528617,70,47.916594202843754
gutenberg,78556,2953834,377,55.51576352712804
nyt,182942,21836689,405,209.63539777432626
This provides: dataset name, vocabulary size, number of (non-stopword) tokens, average document length, and standard deviation of the document length.
Though this paper is focused on evaluting predictive likelihood of topic models and doesn't consider the topics themselves, it is of course usually interesting to see them. We've computed topics for all six datasets and posted them.
If you'd like to compute them yourself (so that you can play around with different numbers of topics for all or a specific dataset, use the output-topics
command. For example, do the following to get the topics for all datasets (using the default 100 topics):
$ bin/tmeval output-topics -o all-topics.txt
The output goes to the file all-topics.txt
. (It will take a while, so be patient.)
The following computes 25 topics for the SGU transcripts data, outputting them to sgu-topics.txt
:
$ bin/tmeval output-topics -d sgu -n 25 -o sgu-topics.txt
Note: the topics you see are computed from both the train and eval portions of each corpus.