Skip to content

Reproducing Real Corpora Experiments

jasonbaldridge edited this page Apr 9, 2013 · 11 revisions

Five of the datasets discussed in the paper are included in their original format in the directory data/orig. These are:

$ ls data/orig 
20news.tar.bz2  gutenberg.tar.bz2  pcl-travel.tar.bz2  reuters21578.tar.bz2  sgu-2013-04-04.tar.bz2

To run the experiments, each of these datasets needs to be processed to be in the input format expected for training and evaluating models. To generate these, simply run:

$ bin/tmeval prepare

Note: this may take some time (possibly hours), depending on your machine. The main source of the slowness is that the bzipped files are being uncompressed file-by-file in the code, rather than a direct decompression using bunzip2. We set it up this way so that it is easy to just run the above command and not have a series of unzipping, moving, and running various commands for each corpus. (It also means there is less dependence on the software you have on your machine.)

The New York Times corpus is not packaged with this repository as it requires having a license to the English Gigaword corpus. If you have that corpus, you can process it for model training and evaluation by running:

$ bin/tmeval prepare <path-to-directory-containing-English-Gigaword>/english-gigaword-LDC2003T05/cdrom0/nyt

These steps will produce directories in data/extracted that are now ready with train and eval portions. Note that for some of the preparations, books or longer texts are cut into sub-parts---each of these is then a "document" for LDA to use in computing the topics.

To replicate the numbers from the paper, run the following:

$ bin/tmeval corpus-exp -o corpus-experiment.csv

This runs on all datasets in the data/extracted directory, with default options used in the paper. Note that this will likely take over a day to complete since it is computing and evaluating 10 topic models for each of five to six corpora (six if you have the New York Times data).

There are options that let you choose a different number of topics, different numbers of draws from the posterior, run on just a single dataset, and also to output the results to a specified file. For example, to use 20 topics and 3 posterior draws for the SGU corpus, do this:

$ bin/tmeval corpus-exp -n 20 -r 3 -d sgu -o sgu.csv

This is also useful if you want to launch six separate jobs, one for each corpus, to speed things up a bit overall.

Use --help with the corpus-exp target to see more details about the options.

Here's the output from running on all corpora that were used to generate the plots in Figure 3 in section 4.3.

pcl-travel,Kalman,-1.4618561663607458E7,-1.4655529237853384E7,-1.4613439085180148E7,-1.459083459031899E7,-1.462129958631983E7,-1.4630333928821385E7,-1.4632542539239062E7,-1.4590249838038314E7,-1.4647191639698178E7,-1.4598685042150786E7,-1.46055111484545E7
pcl-travel,L2R(1),-1.4972755176952055E7,-1.5003004660387432E7,-1.497094602243937E7,-1.4943523409431787E7,-1.4980746062332787E7,-1.4988293637200933E7,-1.4998950115169896E7,-1.494375853559233E7,-1.5009015921355153E7,-1.4941746916213343E7,-1.4947566489397533E7
pcl-travel,L2R(50),-1.4671558891054083E7,-1.4706172340110762E7,-1.467211168241141E7,-1.4642350674787898E7,-1.467664508002009E7,-1.4686415304475706E7,-1.4693588532000408E7,-1.4638850779249396E7,-1.4693840676390497E7,-1.4653579856383191E7,-1.465203398471146E7
pcl-travel,Mallet-L2R(50),-1.4646457878588548E7,-1.4682543063722625E7,-1.4640504968196223E7,-1.461791558831313E7,-1.4649337787827699E7,-1.4658045162837777E7,-1.4665828532792505E7,-1.4621569107532943E7,-1.4673796614094045E7,-1.4624230309410432E7,-1.4630807651158083E7
sgu,Kalman,-1061495.450053688,-1061390.9815716736,-1061256.951134697,-1061158.8323915165,-1061822.2016151936,-1062192.0555095344,-1061790.6006813766,-1061027.1180808346,-1061560.0670532694,-1061476.7012907083,-1061278.9912080748
sgu,L2R(1),-1085169.7668667785,-1085689.4091352455,-1085407.648772532,-1085597.4799047527,-1084719.1251613293,-1086562.8346018703,-1085124.1749676894,-1084870.3663246473,-1083261.6703627326,-1085344.9169802337,-1085120.042456754
sgu,L2R(50),-1065363.5689774633,-1065275.0588298242,-1065537.7535835262,-1065331.75906835,-1066163.5360348742,-1065336.4405339893,-1065527.370861277,-1065155.4537080138,-1064934.763578311,-1065512.0034665852,-1064861.5501098796
sgu,Mallet-L2R(50),-1063921.2719623037,-1063721.0250299496,-1063899.8996610497,-1063806.4904763026,-1064247.1481723941,-1064607.8134383373,-1064040.0196185163,-1063445.3042853465,-1063652.017283178,-1064064.2274841557,-1063728.7741738071
20news,Kalman,-8956183.352329815,-8955804.493440285,-8957530.668187175,-8944356.000909923,-8960815.156925673,-8967395.20309686,-8956616.975285193,-8955638.130871536,-8952928.745149087,-8953584.103764819,-8957164.045667607
20news,L2R(1),-9210585.467911635,-9217366.49164696,-9207365.505841445,-9193956.581807185,-9225511.926334588,-9219125.67893603,-9211744.847231122,-9194686.88218967,-9213767.220140513,-9210433.882546842,-9211895.662441997
20news,L2R(50),-9006598.265182773,-9004713.688728262,-9007971.117539916,-8994958.296959596,-9013139.488399444,-9014067.022757573,-9008504.70750101,-9002364.997603703,-9005157.759303909,-9006380.599426426,-9008724.973607874
20news,Mallet-L2R(50),-8967843.843802076,-8968165.00706046,-8969439.249318898,-8955512.197824324,-8972208.266797889,-8978108.214470858,-8968996.371190084,-8967355.74417477,-8963948.11095668,-8965952.14754158,-8968753.128685206
reuters,Kalman,-2300240.208790208,-2299363.3111717836,-2301912.713656295,-2301133.3474815497,-2300552.7281025695,-2298990.4528278676,-2301416.621595917,-2299445.083458451,-2298772.9805254256,-2301377.7351636845,-2299437.11391854
reuters,L2R(1),-2426736.996555877,-2428581.0346570197,-2427086.525568232,-2425408.5836348026,-2428073.272884081,-2426969.1629735795,-2426580.6620018985,-2420738.5941816606,-2425851.443834125,-2428508.277252385,-2429572.4085709867
reuters,L2R(50),-2332108.2473099777,-2332356.8793088966,-2335715.334419686,-2331678.8889926844,-2332613.065789818,-2332409.302896822,-2332712.905273661,-2329338.6131877797,-2330504.2391536543,-2331233.523183939,-2332519.7208928354
reuters,Mallet-L2R(50),-2307310.983873366,-2306714.446382148,-2308511.4162255256,-2308051.334759716,-2307347.1985815093,-2306664.3310794733,-2308713.3402229534,-2306429.7291149804,-2305493.2669441677,-2308213.1043167943,-2306971.671106394
gutenberg,Kalman,-1.0857335197234202E7,-1.0855190844132615E7,-1.0854938662117872E7,-1.0858204331732133E7,-1.0854295685219957E7,-1.085961873972983E7,-1.0860960664981332E7,-1.0857239858059268E7,-1.0858181126006728E7,-1.085858965955041E7,-1.0856132400811866E7
gutenberg,L2R(1),-1.1065307631149977E7,-1.1069039686560461E7,-1.1065420136066161E7,-1.1068964842089102E7,-1.1054692100290032E7,-1.1061561654385775E7,-1.1071130572707623E7,-1.106159294668906E7,-1.1066291013787419E7,-1.1066256842457611E7,-1.1068126516466537E7
gutenberg,L2R(50),-1.088465846476173E7,-1.0886715490567436E7,-1.0885816670000456E7,-1.0888043629577829E7,-1.0874248322110191E7,-1.0881154792793758E7,-1.088553581772107E7,-1.0882450832237897E7,-1.0890069286810962E7,-1.0884199586912775E7,-1.088835021888492E7
gutenberg,Mallet-L2R(50),-1.0871246300596025E7,-1.0870188815797197E7,-1.0869601181744248E7,-1.0871752639335139E7,-1.0867679976736244E7,-1.0873208647346107E7,-1.0874000961159017E7,-1.0871818644611264E7,-1.0872773023884334E7,-1.0870709189760495E7,-1.0870729925586201E7
nyt,Kalman,-3.7541113731890954E7,-3.754556549807023E7,-3.752445769580288E7,-3.753818681729596E7,-3.756372162043704E7,-3.75418034681634E7,-3.754721481922773E7,-3.752284906405187E7,-3.753676602359335E7,-3.7552270579320796E7,-3.753830173294626E7
nyt,L2R(1),-3.8566750020599045E7,-3.85857332670064E7,-3.854579240414159E7,-3.857458833227764E7,-3.85598243142823E7,-3.856090042442388E7,-3.85858210030933E7,-3.85726353554817E7,-3.853195088426501E7,-3.858447796424011E7,-3.856577625677846E7
nyt,L2R(50),-3.7742441965639256E7,-3.776095781245066E7,-3.772257692710528E7,-3.7740951478248805E7,-3.775906477535271E7,-3.773532263142288E7,-3.774909467812409E7,-3.773617352995422E7,-3.773082871364599E7,-3.775324264849756E7,-3.7736206461590335E7
nyt,Mallet-L2R(50),-3.7648492040135354E7,-3.765872471130659E7,-3.762918650339382E7,-3.764389849933109E7,-3.766948981458876E7,-3.765152966130101E7,-3.765438297758812E7,-3.763111966094828E7,-3.764272683250118E7,-3.76602069659566E7,-3.764365477443809E7

TBA: how to turn the above data into Figure 3 using R.

Clone this wiki locally