Home

This wiki provides instructions for reproducing results plus some additional output pertaining to the paper.

Compiling the code

Go to the top-level directory topicmodel-eval and run:

$ ./build compile

This may take a while as dependencies are downloaded and the code is compiled. If this succeeds, you are ready to move on.

Note: Unless indicated otherwise, the commands given below (and in sub-pages) assume you are in the top-level directory topicmodel-eval.

Reproducing results from the paper

We give instructions for running the code to produce the data and plots in the paper. Keep in mind that there is randomness in the algorithms, so the precise numbers you get won't be the same as ours. (Keep in mind that this paper is about the broad patterns observed with likelihoods, not specific values.)

[Small simulation experiments](wiki/Reproducing Small Simulation Experiments)
Larger simulation experiments
Real corpora experiments

Corpus statistics

To get some measurements for the different corpora (Table 1 in the paper), run the following:

$ bin/tmeval corpus-stats

You'll see information pass by for each corpus, followed at the end by the following output:

pcl-travel,188765,4780051,469,367.51598604686575
sgu,26851,421621,472,678.2764922949932
20news,114547,2743124,145,353.43882073139616
reuters,43153,1528617,70,47.916594202843754
gutenberg,78556,2953834,377,55.51576352712804
nyt,182942,21836689,405,209.63539777432626

This provides: dataset name, vocabulary size, number of (non-stopword) tokens, average document length, and standard deviation of the document length.

Seeing the topics themselves

Though this paper is focused on evaluting predictive likelihood of topic models and doesn't consider the topics themselves, it is of course usually interesting to see them. We've computed topics for all six datasets and posted them.

If you'd like to compute them yourself (so that you can play around with different numbers of topics for all or a specific dataset, use the output-topics command. For example, do the following to get the topics for all datasets (using the default 100 topics):

$ bin/tmeval output-topics -o all-topics.txt

The output goes to the file all-topics.txt. (It will take a while, so be patient.)

The following computes 25 topics for the SGU transcripts data, outputting them to sgu-topics.txt:

$ bin/tmeval output-topics -d sgu -n 25 -o sgu-topics.txt

Note: the topics you see are computed from both the train and eval portions of each corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Compiling the code

Reproducing results from the paper

Corpus statistics

Seeing the topics themselves

Clone this wiki locally