-
Notifications
You must be signed in to change notification settings - Fork 1
Reproducing Real Corpora Experiments
Five of the datasets discussed in the paper are included in their original format in the directory data/orig
. These are:
$ ls data/orig
20news.tar.bz2 gutenberg.tar.bz2 pcl-travel.tar.bz2 reuters21578.tar.bz2 sgu-2013-04-04.tar.bz2
To run the experiments, each of these datasets needs to be processed to be in the input format expected for training and evaluating models. To generate these, simply run:
$ bin/tmeval prepare
Note: this may take some time (possibly hours), depending on your machine. The main source of the slowness is that the bzipped files are being uncompressed file-by-file in the code, rather than a direct decompression using bunzip2. We set it up this way so that it is easy to just run the above command and not have a series of unzipping, moving, and running various commands for each corpus. (It also means there is less dependence on the software you have on your machine.)
The New York Times corpus is not packaged with this repository as it requires having a license to the English Gigaword corpus. If you have that corpus, you can process it for model training and evaluation by running:
$ bin/tmeval prepare <path-to-directory-containing-English-Gigaword>/english-gigaword-LDC2003T05/cdrom0/nyt
These steps will produce directories in data/extracted
that are now ready with train and eval portions. Note that for some of the preparations, books or longer texts are cut into sub-parts---each of these is then a "document" for LDA to use in computing the topics.
To replicate the numbers from the paper, run the following:
$ bin/tmeval corpus-exp -o corpus-experiment.csv
This runs on all datasets in the data/extracted
directory, with default options used in the paper. Note that this will likely take over a day to complete since it is computing and evaluating 10 topic models for each of five to six corpora (six if you have the New York Times data).
There are options that let you choose a different number of topics, different numbers of draws from the posterior, run on just a single dataset, and also to output the results to a specified file. For example, to use 20 topics and 3 posterior draws for the SGU corpus, do this:
$ bin/tmeval corpus-exp -n 20 -r 3 -d sgu -o sgu.csv
This is also useful if you want to launch six separate jobs, one for each corpus, to speed things up a bit overall.
Use --help
with the corpus-exp target to see more details about the options.
Here's the output from running on all corpora that were used to generate the plots in Figure 3 in section 4.3.
pcl-travel,Kalman,-1.4618561663607458E7,-1.4655529237853384E7,-1.4613439085180148E7,-1.459083459031899E7,-1.462129958631983E7,-1.4630333928821385E7,-1.4632542539239062E7,-1.4590249838038314E7,-1.4647191639698178E7,-1.4598685042150786E7,-1.46055111484545E7
pcl-travel,L2R(1),-1.4972755176952055E7,-1.5003004660387432E7,-1.497094602243937E7,-1.4943523409431787E7,-1.4980746062332787E7,-1.4988293637200933E7,-1.4998950115169896E7,-1.494375853559233E7,-1.5009015921355153E7,-1.4941746916213343E7,-1.4947566489397533E7
pcl-travel,L2R(50),-1.4671558891054083E7,-1.4706172340110762E7,-1.467211168241141E7,-1.4642350674787898E7,-1.467664508002009E7,-1.4686415304475706E7,-1.4693588532000408E7,-1.4638850779249396E7,-1.4693840676390497E7,-1.4653579856383191E7,-1.465203398471146E7
pcl-travel,Mallet-L2R(50),-1.4646457878588548E7,-1.4682543063722625E7,-1.4640504968196223E7,-1.461791558831313E7,-1.4649337787827699E7,-1.4658045162837777E7,-1.4665828532792505E7,-1.4621569107532943E7,-1.4673796614094045E7,-1.4624230309410432E7,-1.4630807651158083E7
sgu,Kalman,-1061495.450053688,-1061390.9815716736,-1061256.951134697,-1061158.8323915165,-1061822.2016151936,-1062192.0555095344,-1061790.6006813766,-1061027.1180808346,-1061560.0670532694,-1061476.7012907083,-1061278.9912080748
sgu,L2R(1),-1085169.7668667785,-1085689.4091352455,-1085407.648772532,-1085597.4799047527,-1084719.1251613293,-1086562.8346018703,-1085124.1749676894,-1084870.3663246473,-1083261.6703627326,-1085344.9169802337,-1085120.042456754
sgu,L2R(50),-1065363.5689774633,-1065275.0588298242,-1065537.7535835262,-1065331.75906835,-1066163.5360348742,-1065336.4405339893,-1065527.370861277,-1065155.4537080138,-1064934.763578311,-1065512.0034665852,-1064861.5501098796
sgu,Mallet-L2R(50),-1063921.2719623037,-1063721.0250299496,-1063899.8996610497,-1063806.4904763026,-1064247.1481723941,-1064607.8134383373,-1064040.0196185163,-1063445.3042853465,-1063652.017283178,-1064064.2274841557,-1063728.7741738071
20news,Kalman,-8956183.352329815,-8955804.493440285,-8957530.668187175,-8944356.000909923,-8960815.156925673,-8967395.20309686,-8956616.975285193,-8955638.130871536,-8952928.745149087,-8953584.103764819,-8957164.045667607
20news,L2R(1),-9210585.467911635,-9217366.49164696,-9207365.505841445,-9193956.581807185,-9225511.926334588,-9219125.67893603,-9211744.847231122,-9194686.88218967,-9213767.220140513,-9210433.882546842,-9211895.662441997
20news,L2R(50),-9006598.265182773,-9004713.688728262,-9007971.117539916,-8994958.296959596,-9013139.488399444,-9014067.022757573,-9008504.70750101,-9002364.997603703,-9005157.759303909,-9006380.599426426,-9008724.973607874
20news,Mallet-L2R(50),-8967843.843802076,-8968165.00706046,-8969439.249318898,-8955512.197824324,-8972208.266797889,-8978108.214470858,-8968996.371190084,-8967355.74417477,-8963948.11095668,-8965952.14754158,-8968753.128685206
reuters,Kalman,-2300240.208790208,-2299363.3111717836,-2301912.713656295,-2301133.3474815497,-2300552.7281025695,-2298990.4528278676,-2301416.621595917,-2299445.083458451,-2298772.9805254256,-2301377.7351636845,-2299437.11391854
reuters,L2R(1),-2426736.996555877,-2428581.0346570197,-2427086.525568232,-2425408.5836348026,-2428073.272884081,-2426969.1629735795,-2426580.6620018985,-2420738.5941816606,-2425851.443834125,-2428508.277252385,-2429572.4085709867
reuters,L2R(50),-2332108.2473099777,-2332356.8793088966,-2335715.334419686,-2331678.8889926844,-2332613.065789818,-2332409.302896822,-2332712.905273661,-2329338.6131877797,-2330504.2391536543,-2331233.523183939,-2332519.7208928354
reuters,Mallet-L2R(50),-2307310.983873366,-2306714.446382148,-2308511.4162255256,-2308051.334759716,-2307347.1985815093,-2306664.3310794733,-2308713.3402229534,-2306429.7291149804,-2305493.2669441677,-2308213.1043167943,-2306971.671106394
gutenberg,Kalman,-1.0857335197234202E7,-1.0855190844132615E7,-1.0854938662117872E7,-1.0858204331732133E7,-1.0854295685219957E7,-1.085961873972983E7,-1.0860960664981332E7,-1.0857239858059268E7,-1.0858181126006728E7,-1.085858965955041E7,-1.0856132400811866E7
gutenberg,L2R(1),-1.1065307631149977E7,-1.1069039686560461E7,-1.1065420136066161E7,-1.1068964842089102E7,-1.1054692100290032E7,-1.1061561654385775E7,-1.1071130572707623E7,-1.106159294668906E7,-1.1066291013787419E7,-1.1066256842457611E7,-1.1068126516466537E7
gutenberg,L2R(50),-1.088465846476173E7,-1.0886715490567436E7,-1.0885816670000456E7,-1.0888043629577829E7,-1.0874248322110191E7,-1.0881154792793758E7,-1.088553581772107E7,-1.0882450832237897E7,-1.0890069286810962E7,-1.0884199586912775E7,-1.088835021888492E7
gutenberg,Mallet-L2R(50),-1.0871246300596025E7,-1.0870188815797197E7,-1.0869601181744248E7,-1.0871752639335139E7,-1.0867679976736244E7,-1.0873208647346107E7,-1.0874000961159017E7,-1.0871818644611264E7,-1.0872773023884334E7,-1.0870709189760495E7,-1.0870729925586201E7
nyt,Kalman,-3.7541113731890954E7,-3.754556549807023E7,-3.752445769580288E7,-3.753818681729596E7,-3.756372162043704E7,-3.75418034681634E7,-3.754721481922773E7,-3.752284906405187E7,-3.753676602359335E7,-3.7552270579320796E7,-3.753830173294626E7
nyt,L2R(1),-3.8566750020599045E7,-3.85857332670064E7,-3.854579240414159E7,-3.857458833227764E7,-3.85598243142823E7,-3.856090042442388E7,-3.85858210030933E7,-3.85726353554817E7,-3.853195088426501E7,-3.858447796424011E7,-3.856577625677846E7
nyt,L2R(50),-3.7742441965639256E7,-3.776095781245066E7,-3.772257692710528E7,-3.7740951478248805E7,-3.775906477535271E7,-3.773532263142288E7,-3.774909467812409E7,-3.773617352995422E7,-3.773082871364599E7,-3.775324264849756E7,-3.7736206461590335E7
nyt,Mallet-L2R(50),-3.7648492040135354E7,-3.765872471130659E7,-3.762918650339382E7,-3.764389849933109E7,-3.766948981458876E7,-3.765152966130101E7,-3.765438297758812E7,-3.763111966094828E7,-3.764272683250118E7,-3.76602069659566E7,-3.764365477443809E7
This file sits in the results/ directory. To produce the PDF files from the paper, run
R CMD BATCH makeplots.R
at the command line. NB: this script will also produce the output for the large simulated experiments.