-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathreport-on-topic-modeling-interfaces.html
1354 lines (1184 loc) · 82.4 KB
/
report-on-topic-modeling-interfaces.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<head></head>
<body>
<h1 id="topic-modeling-systems-and-interfaces">Topic Modeling Systems and Interfaces</h1>
<p>The 4Humanities “WhatEvery1Says” project conducted a comparative analysis in 2016
of the following topic modeling systems/interfaces. As a result, it chose to
implement Andrew Goldstone’s DFR-browser for its own work.</p>
<p>Report first published on November 26, 2017.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ul>
<li><a href="#convisit">ConVisit</a>
</li>
<li><a href="#dfr-browser">DFR-Browser</a>
</li>
<li><a href="#inpho">InPHO Topic Explorer</a>
</li>
<li><a href="#networked-corpus">The Networked Corpus</a>
</li>
<li><a href="#pyldavis">pyLDAvis</a>
</li>
<li><a href="#serendip">Serendip</a>
</li>
<li><a href="#termite">Termite</a>
</li>
<li><a href="#tiara">TIARA</a>
</li>
<li><a href="#tom">TOM</a>
</li>
<li><a href="#tome">TOME</a>
</li>
<li><a href="#topic-browser">The Topic Browser</a>
</li>
<li><a href="#topical-guide">Topical Guide</a>
</li>
<li><a href="#topicnets">TopicNets</a>
</li>
<li><a href="#twic">TWIC</a>
</li>
</ul>
<p>(These following were the materials that the WE1S team researchedin advance of
its February 18, 2016, meeting focused on choosing and implementing a system/platform/interface
for the exploration and interpretation of topic models.)</p>
<hr>
<h2 id="convisit"><a>ConVisIT</a></h2>
<ul>
<li><strong>Description</strong>: E. Hoque and Giuseppe Carenini (2015), <a href="http://www.cs.ubc.ca/~carenini/TEACHING/CPSC503-16/READINGS/iui0167-paper-SUBMITTED.pdf">“ConVisIT: Interactive Topic Modeling for Exploring Asynchronous Online Conversations”</a>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>An all-in-one, start-to-finish system that does its own topic modeling of
a corpus.</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>:
<br>
<ol>
<li>Interactive visualization interface designed for topic modeling asynchronous
conversation on the Internet (email, blog comments, etc.).</li>
<li>Interface shows overall conversation (left panel in figure)</li>
<li>Interface also shows the actual conversation (right panel in figure)</li>
<li>“Human-in-the-loop” feature to allow humans iteratively to assess results
of a topic model and tweak it interactively in a sense-making activity–e.g.,
change granularity of topics, merge or split topics, suppress a topic,
specify that words must (or must not) be in a topic,</li>
<li>Has an algorithm for automatic labeling of topics. (p. 4 of PDF)</li>
<li>Feedback on the interface was assessed through a user study.</li>
</ol>
</li>
<li><strong>Code site</strong>: [unknown]</li>
<li><strong>Notes by WE1S team</strong>:
<br>
<ul>
<li><em>Alan</em>: Can it be adapted for articles?</li>
</ul>
</li>
</ul>
<h3 id="screen-shots">Screen Shots</h3>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/ConVisIT.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/ConVisIT-th.jpg"
alt="ConVisIT" title="">
</a>
</p>
<hr>
<h2 id="dfr-browser"><a>DFR-Browser</a></h2>
<ul>
<li><strong>Description</strong>: Andrew Goldstone, <a href="http://agoldst.github.io/dfr-browser/">“Dfr-Browser: Take a MALLET to Disciplinary History”</a> (2013)</li>
<li><strong>Demos</strong>: <a href="http://agoldst.github.io/dfr-browser/demo/">Topics in</a> <em><a href="http://agoldst.github.io/dfr-browser/demo/">PMLA</a></em> |
<a
href="http://signsat40.signsjournal.org/topic-model/">Topics in</a> <em><a href="http://signsat40.signsjournal.org/topic-model/">Signs</a></em> | <a href="http://jgoodwin.net/htb/">Hathi Trust Fiction 1920-22</a>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>Out of the box, Goldstone’s DFR-Browser is specialized to take input from
Jstor’s <a href="http://about.jstor.org/service/data-for-research">DFR (Data for Research)</a> service, run it through Mallet using Goldstone’s companion R package,
<a
href="http://github.com/agoldst/dfrtopics">dfrtopics</a>, and then use <em>d3</em> to generate a dynamic visual exploration
interface. This start-to-finish workflow is modularized, however, allow
for the use of alternative methods for generating the topic models and
formatted data files that DFR-Browser expects:</li>
<li>Instead of using Goldstone’s R package to generate the Mallet topic model
and then create the specially formatted data files for the DFR-Browser,
a user can run Mallet and output the formattted data files entirely on
the command line. (See <a href="https://github.com/agoldst/dfr-browser#preparing-data-files-entirely-on-the-command-line">instructions here</a> in the Github repo.)</li>
<li>There is also a section in the Github repo titled <a href="https://github.com/agoldst/dfr-browser#browser-data-file-specifications">“Browser data file specifications”</a> that gives detailed instructions about the format and nature of the data
files that DFR-Browser expects. (In principle, this should allow topic
model files that were pre-generated in other ways to be converted into
data files for DFR-Browser.)</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>: A dynamic visual
exploration interface with several main views:
<br>
<ul>
<li><em>“Overview”</em> (top figure at right) showing topics as circles laid
out in a regular grid, each circle labeled with the six most important
words in a topic. Clicking on a topic brings the user to →</li>
<li><em>“Topic” view</em> (bottom figure at right) showing a ranked list of words
in the topic with bars representing relative weight (left panel), a timeline
graph of the topic’s weight in the corpus, and a list of articles ranked
by the amount of the topic infused in them.</li>
<li>Clicking on a word in the ranked list of topic words brings the user to →
<em>“Word” view</em>, which shows what other topics the word appears in
(and its relative weight in that topic).</li>
<li>Clicking on a document brings the user to → <em>“Document” view</em>, which
shows a ranked list of other topics in that document (and their relative
weights in the document)</li>
<li>Clicking on a bar in the timeline graph of a topic brings the user to → a
view showing the top documents in that year infused by that topic.</li>
</ul>
</li>
<li><strong>Code site</strong>: <a href="https://github.com/agoldst/dfr-browser">GitHub repo</a>.
The following sections of the documentation on Github indicate that we might
be able to generate the topic-modeling and other data files for the DFR-Browser
with the WE1S material:
<br>
<ul>
<li><a href="https://github.com/agoldst/dfr-browser#preparing-data-files-entirely-on-the-command-line">“Preparing data files entirely on the command line”</a>
</li>
<li><a href="https://github.com/agoldst/dfr-browser#adapting-this-project-to-other-kinds-of-documents">“Adapting this project to other kinds of documents”</a>
</li>
</ul>
</li>
<li><strong>Notes by WE1S team</strong>:
<br>
<ul>
<li><em>Alan</em>: we should try using the command line as instructed in the
documentation to see if we can create the data files in the format needed
for DFR-Browser</li>
<li><em>Lindsay</em>: I kind of got this to work. With Goldstone’s new instructions
and the prepare-data python script, I was able to figure out how to get
the data into the format the browser needs in the command line. I was also
able to do this using the dfrtopics package in R with a pre-run mallet
file (which I did in the command line). However, I haven’t yet figured
out how to properly configure the browser’s main js file so that it will
all display properly (right now only the overview view works as it should).
Also, I tried this on a very small subset of our corpus (10 articles from
the NYT) because getting the metadata into the right shape so that dfrtopics/the
prepare-data python script can read it properly is still something I don’t
know how to do. I just did it manually, and it worked, but we would have
to create a script that could wrangle metadata for us in order to do this
for a larger number of documents.</li>
</ul>
</li>
</ul>
<h3 id="screen-shots-1">Screen Shots</h3>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-1.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-1-th.jpg"
alt="DFR-Browser, multiple topics view" title="">
</a>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-2.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-2-th.jpg"
alt="DFR-Browser, single topic view" title="">
</a>
</p>
<hr>
<h2 id="inpho-topic-explorer"><a>InPhO Topic Explorer</a></h2>
<ul>
<li><strong>Description (Demos)</strong>: <a href="http://inphodata.cogs.indiana.edu/">home page</a>
</li>
<li>See also Jaimie Murdock and Colin Allen (2015), <a href="http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/10007/9852">“Visualization Techniques for Topic Model Checking”</a>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>An all-in-one, start-to-finish system; does its own topic modeling of a corpus.</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>:
<br>
<ul>
<li>Interactive visual exploration interface that shows:</li>
<li>List of all documents in a corpus (file names listed vertically down the
left, with the first line of the document showing to help the user grok
the article). Documents are ranked according to the weight of the topic
that is a user’s “focus” at present. (E.g., if a user is examining Topic
1, then the document where Topic 1 is most prevalent will be at the top.)</li>
<li>Bands of color superimposed over every document filename and first line that
are color-coded to the topics in the topic model (where the legend for
topic colors is at the right)</li>
<li>The relative size of each color band among all the colors in a document indicates
the weight of specific topics in the article.</li>
<li>The cumulative width of the color bands for a document indicates the similarity
of the document to the user’s current “focus” topic (or document).</li>
<li>Clicking on a color anywhere reset’s the “focus” to the topic corresponding
to that color, with the whole list of articles and color bands shifting
to reorient around that topic (ranked with the articles most expressive
of that topic at the top).</li>
<li>When a topic is selected, clicking the “Top Documents for [Topic]” button
at lower right of the interface “will take you to a new page showing the
most similar documents to that topic’s word distribution.”</li>
<li>There is a search function to identify which documents in a corpus contain
a word.</li>
</ul>
</li>
<li><strong>Code site</strong>: <a href="https://github.com/inpho/topic-explorer">GitHub repo</a> (an Anaconda 2.7 Python distribution)</li>
<li><strong>Notes by WE1S team</strong>:
<br>
<ul>
<li><em>Alan</em>: This is a package for Python 2.7. I tried to install, but
installation of the underlying VSM module failed. One issue that appeared
in the error messages: “error: Microsoft Visual C++ 9.0 is required (Unable
to find vcvarsall.bat). Get it from <a href="http://aka.ms/vcpython27">http://aka.ms/vcpython27</a>”
I’ve seen this error before when a package installation on a Windows machines
calls on Visual C++ as its compiler, but the particular machine does not
have Visual C++.</li>
<li>Scott: The above issue has supposedly been addressed (as of May 25, 2016),
so it might be worth pulling the repo again. The tool has some interesting
features that might help in defining stopword lists.</li>
</ul>
</li>
</ul>
<h3 id="screen-shots-2">Screen Shots</h3>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/inpho.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/inpho-th.jpg"
alt="InPhO" title="">
</a>
</p>
<hr>
<h2 id="the-networked-corpus"><a>The Networked Corpus</a></h2>
<ul>
<li><strong>Description</strong>: Jeff Binder and Collin Jennings, <a href="http://www.networkedcorpus.com/">The Networked Corpus</a>
<br>
<ul>
<li>See also article: Jeffrey Binder and Collin Jennings (2014), <a href="http://llc.oxfordjournals.org/content/29/3/405.full">“Visibility and Meaning in Topic Models and 18th-Century Indexes”</a> (2014)</li>
</ul>
</li>
<li><strong>Topic modeling workflow</strong>: Takes input from Mallet.</li>
<li>
<p>** <strong>Notable interpretive features of interface</strong>:
<br>
</p>
<ul>
<li>Interactive visualization interface that takes input from Mallet. The key
design principle of the interface is to avoid using topic labels (which
can be deceptive) but instead to provide an easy way to identify passages
in documents that are “dense” with a particular topic, allow them to be
compared to other passages also dense with the topic, and thus provide
the user with an understanding of a topic’s meaning built up from intertextual
context.
<br> In particular, the interface:</li>
<li>Shows a document in the left panel; and shows a list of topics (by number)
in the right panel</li>
<li>Choosing a topic number highlights in the document the words that belong
to that topic</li>
<li>A line graphs the topic “density” of passages in the document, with peaks
indicated with asterisk. Clicking on the asterisk calls up a list of links
to other passages (including in other documents) that are dense with that
topic.
<br>
</li>
</ul>
<p></p>
<blockquote>
<p>(The density functions in the interface are calculated using Mallet’s topic-state
file as follows: <em>“the density function is computed using kernel density estimation, which takes into account the words in nearby lines. Using these density functions, the program picks out ‘exemplary passages’ for each topic based on a simple rubric. Passages are only selected if the topic matches at least a certain number of words in the text (default 25), and they are only added if the topic’s maximum density in the text is at least (by default) four times as high as the average density of the whole document. If both of these conditions are met, an asterisk is created at the point of greatest density, with links to every other asterisk that was created for that topic.”</em>)</p>
</blockquote>
</li>
</ul>
<br>
<em>Note</em>: one interesting theoretical tenet of The Networked Corpus is that
topic modeling produces an apparatus for understanding texts and moving around
them non-linearly in a way analogous to earlier “indexing” (and other such apparatus)
in the history of writing and print.
<li><strong>Code site</strong>: <a href="https://github.com/jeffbinder/networkedcorpus">GitHub repo</a>
</li>
<li><strong>Notes by WE1S team</strong>:
<br>
<ul>
<li><em>Scott</em>: <strong><a href="https://github.com/scottkleinman/WE1S/tree/master/networkedcorpus">Instructions for implementing The Networked Corpus</a></strong>.;
includes adapted version of the code files as <a href="https://raw.githubusercontent.com/scottkleinman/WE1S/master/networkedcorpus/networkedcorpus.zip">zip file</a>;
currently the instructions are for implementation on a Windows machine.</li>
<li><em>Alan’s</em> <strong><a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/page/106519659/Alan%27s%20Instructions%20for%20Implementing%20The%20Networked%20Corpus">step-by-step version of Scott’s instructions</a></strong>,
including a temporary kludge solution for non-ASCII character problems (arrived
at after debugging correspondence with Scott below).</li>
<li><em>Alan</em>: This set of instructions gets The Networked Corpus to run. However,
the results are not as expected and do not match the screenshots seen at
the right. At first, my thought was that the <em>browser.css</em> and <em>index.css</em> files generated by the <em>gen-networked-corpus.py</em> script, together
with the HTML in the html versions of each original text file also generated
by that script, need to be tweaked for today’s browsers. However, after investigation
and experiments, that seems not to be the case.
<br> Instead, the problem seems to lie in a mismatch between our input text format
and that expected by Networked Corpus. Here are the relevant instructions
in the Networked Corpus Github site: <em>“The text files must have hard line breaks at the end of each line. This is used to calculate how far down the page a word occurs, and also affects how wide the text will appear in the browser. If your source documents do not have line breaks and you are on a Mac or Linux system, you can use the ‘fold’ command to wrap them automatically. It doesn’t matter whether the line breaks are Unix or DOS-style. Finally, the first line of each file should be a title; this will be used in the table of contents and in a few other places.”</em>
<br> The plain-text article files in the WE1S corpus have no line breaks. Less
important, they are not formatted in a way that makes the first line the
title. (Instead, Networked Corpus ends up treating the entire text as a title,
placing that in the title element in the head of each of the HTML file version
it generates for an article.)
<br> When using Mallet to create a topic model, the format of the original plain-text
file and the presence or absence of line breaks is irrelevant. However, the
way Networked Corpus seems to work is that it creates a HTML version of each
original plain text file, which when opened in a browser is correlated via
javascript scripts to the Mallet data about that file on a token-by-token
basis. The format of the original plain-text files and the presence or absence
of line breaks has an impact on these HTML files and they way they are displayed
in a browser. In particular, Networked Corpus creates in the HTML page for
an article a table of the text in which topic words for a chosen topic are
highlighted. Each “line” of a text is supposed to be a single row, so that
the table extends down the page row-by-row. But if the original plain-text
file has no line breaks, then there is just a single row extending off the
right of the page, nullifying the whole point of Networked Corpus’s document
view of the topic model.
<br> It seem that the next step is to try the “fold” command referred to in the
instructions from Networked Corpus’s Github site above on the WE1S article
files and see what we get.</li>
</ul>
</li>
<li>Debugging issues to date leading to above implementation solution:</li>
<li><em>Alan’s error report</em> in response to Scott’s initial instructions for implementing
The Networked Corpus. Implementation produced the following error:</li>
<pre class="prettyprint"><code class="language-python hljs "> ___init__.py<span class="hljs-string">", line 586, in <module>_
_ from ._ufuncs import *_
_ImportError: DLL load failed: The specified module could not be found._</span></code></pre>
<ul>
<li><em>Scott on the DLL load error</em>: “I think the answer might be <a href="http://stackoverflow.com/questions/31596125/python-dll-load-failed">here</a>.
Try running <code>conda update scipy</code> from the command line.”</li>
<li><em>Alan’s error report</em>: “This worked. I’m finally getting gen-networked-corpus.py
to run.
<br> However, I’m now getting a unicode error:</li>
</ul>
<pre class="prettyprint"><code class="language-python hljs "> _File <span class="hljs-string">"C:\Users\Alan\Anaconda\lib\codecs.py"</span>, line <span class="hljs-number">492</span>, <span class="hljs-keyword">in</span> read_
_ newchars, decodedbytes = self.decode(data, self.errors)_
_UnicodeDecodeError: <span class="hljs-string">'utf8'</span> codec can<span class="hljs-string">'t decode byte 0xac in position 0: invalid start byte_
I created the .mallet file for the Mallet topic model using the regex parameter you
suggested: _--token-regex "[\p{L}\p{M}]+"_</span></code></pre>
<pre><code>I'm guessing this is the kind of error that caused you to start debugging the unicode problems in the first place. Let me know if you have any suggestions."
</code></pre>
<ul>
<li><em>Scott’s response</em>: “The line causing the Unicode error is part of a loop
through a directory file list, so it seems to run into problems if the directory
contains something other than the text files you are using to generate your
topic model, This includes the Mallet output. When I set line 303 to a directory
containing only the text files (in this case, one of your early New York Times
collections), I didn’t get the error.</li>
</ul>
<p>Unfortunately, I got another error at the next stage, where the script was getting
hung up at the name “François”. Obviously, we can avoid this problem by striping
diacritics,but we shouldn’t have to. When I get a chance, I’ll try to figure
it out. But go ahead and try the advice in the previous paragraph, and see if
it works for you.”</p>
<ul>
<li><em>Alan’s error report</em>: “Thanks, Scott. I see. I was misunderstanding what
the “datadir=” in line 303 is supposed to point to: the directory of original
text files and not the directory of Mallet output files for the topic model
of those text files.</li>
</ul>
<p>Unfortunately, after getting that right I am getting another Unicode error that
may be indicating an unexpected character in the plain text (just as you did):</p>
<pre class="prettyprint"><code class="language-python hljs "> _File <span class="hljs-string">"C:\Users\Alan\Anaconda\lib\encodings\cp437.py"</span>, line <span class="hljs-number">12</span>, <span class="hljs-keyword">in</span> encode_
_ <span class="hljs-keyword">return</span> codecs.charmap_encode(input,errors,encoding_map)_
_UnicodeEncodeError: <span class="hljs-string">'charmap'</span> codec can<span class="hljs-string">'t encode character u'</span>\u0301<span class="hljs-string">' in position 72: character maps to <undefined>_</span></code></pre>
<ul>
<li><em>Alan’s temporary kludge solution to the above error</em>: Use “search and
replace” in Notepad++ (set to regex) to delete all non-ASCII characters in
the article files being topic modeled for The Networked Corpus. (<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/page/106519659/Alan%27s%20Instructions%20for%20Implementing%20The%20Networked%20Corpus">See instructions</a>)</li>
<li>(<em>Scott’s earlier notes</em>: Some observations: The script must be run from
within the input directory, and the script expects the Mallet output files
to be named as shown in the sample command on GitHub. When I ran it, I encountered
Unicode errors, so I tried a model using <code><span style="font-family:'Courier New';">--token-regex '[\p{L}\p{M}]+'</code>,
as suggested on the GitHub repo. However, this caused the Mallet train-topics
command to fail. Apparently, in Windows the regular expression <em>must</em> be enclosed in double quotes. The Python script also seems to have substantial
problems with character encoding and/or Windows. I am hacking my way through
it, gradually getting closer to a full implementation, but at the moment I’m
stuck on a particularly confusing block of code. <strong>Update</strong>: I
have never actually managed to get <code><span style="font-family:'Courier New';">--token-regex</code> to work in Mallet, so the point about double quotes is important independent
of the Networked Corpus tool. As for the tool itself, I have finally managed
to get it to run all the way through. I had to hack the code and inject my
own paths to get it to pull data from the right folders. The result was a little
disappointing, as it produced buggy html/css/javascript (or some combination
of those). The following information is readable. Document Index, Topic Index,
top 10 topics in each document, top 10 documents in each topic. The script
is supposed to choose “exemplary passages” if the topic matches 25 words in
the text and the topic’s maximum density in the text is at least 4 times as
high as the average over the whole document. There did not appear to be any
“exemplary passages”, perhaps because I used Mallet’s tiny sample data set
to build my model. Supposedly, if both of these conditions are met, an asterisk
is created at the point of greatest density, with links to every other asterisk
that was created for that topic. From the images displayed on the website,
this appears to be a visualisation function using protovis.js. Either the javascript
failed or it wasn’t called simply because my data did not produce any exemplary
passages.)</li>
<li><em>Alan</em>: Just to add information that may, or may not. be relevant to Scott’s
original problem with Unicode issues as documented in his note: the WE1S scraping
workflow saves all plain-text files in UTF-8.</li>
</ul>
<h3 id="screen-shots-3">Screen Shots</h3>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/networked-corpus.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/networked-corpus-th.jpg"
alt="Networked Corpus" title="">
</a>
<br>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus1.PNG"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus1.PNG"
alt="Networked Corpus" title="">
</a>
<br>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus2.PNG"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus2.PNG"
alt="Networked Corpus" title="">
</a>
<br>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus2.PNG">Click for a readable image</a>
</p>
<hr>
<h2 id="pyldavis"><a>pyLDAvis</a></h2>
<ul>
<li><strong>Description</strong>: Ben Mabey and Paul English, <a href="https://github.com/bmabey/pyLDAvis">pyLDAvis</a>
</li>
<li>A Python port of the LDAvis R package.</li>
<li>For a concise explanation of the visualization see this <a href="http://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf">vignette</a> from the LDAvis R package.</li>
<li>GitHub</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>Takes input from multiple types of topic models.</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>:
<br>
<ul>
<li>The GitHub site links to numerous examples and demos.</li>
</ul>
</li>
<li><strong>Notes by WE1S team</strong>:
<br>
<ul>
<li><em>Not yet reviewed.</em>
</li>
<li>Requires scikit-bio package, which is not yet supported for Windows. Windows
support is scheduled for July 2016. In the meantime, there may be a workaround
<a href="http://stackoverflow.com/questions/27029212/trouble-installing-scikit-bio-on-windows-xp">here</a>,
but it has not yet been tested.</li>
<li>Additionally, scikit-bio is now no longer compatible with Python 2 and thus
would require a separate Python 3 virtual environment (although that’s
relatively easy to do in an Anaconda installation). It may be worth looking
into using the <a href="https://github.com/cpsievert/LDAvis">R package</a>.</li>
</ul>
</li>
</ul>
<h3 id="screen-shots-4">Screen Shots</h3>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/file/106758792/pyLDAvis1.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/pyLDAvis1.png"
alt="pyLDAvis" title="">
</a>
<br>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/file/106758792/pyLDAvis2.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/pyLDAvis2.png"
alt="pyLDAvis" title="">
</a>
</p>
<hr>
<h2 id="serendip"><a>Serendip</a></h2>
<p><em>Note: Parts of the descriptions and screenshots in the mini-report on Serendip here are excerpted from Scott Kleinman’s <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/pdf/report-on-serendip.pdf">report-on-serendip.pdf</a>. Other descriptions and one screenshot are based on the Eric Alexander et al. article.</em>
</p>
<ul>
<li><strong>Description</strong>: Eric Alexander, et al. (2014), <a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwiti_zIh7TKAhUBTGMKHTi_At0QFggiMAE&url=https%3A%2F%2Fgraphics.cs.wisc.edu%2FPapers%2F2014%2FAKVWG14%2FPreprint.pdf&usg=AFQjCNG-VY5ModzUaOQo8TrvVefKg50a5w&sig2=d-jzuGMxh9yFNjkrud5ghw">“Serendip: Topic Model-Driven Visual Exploration of Text Corpora”</a> (preprint)
<br>
<ul>
<li>See also the <a href="http://vep.cs.wisc.edu/serendip/">Project iPython notebook site</a>.</li>
</ul>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>Serendip runs in a Python-Flask environment. It comes with a separate command-line
tool to call Mallet and generate topic models. The Mallet output data is
deposited in a Corpora folder and can then be accessed by the Serendip
interface. In addition to implementing Mallet, the command-line tool generates
multiple files used by the interface to navigate and manipulate the data.
Therefore, Serendip will not work if independently generated Mallet output
files are deposited in the Corpora folder. It is possible that the script
could be modified to read independently generated Mallet data, but this
would require some hacking of the Python script.</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>:
<br>
<ul>
<li>Serendip is designed to give a view of a topic model at many scales, and
with much connection between views. There are three main views (see 1st
figure at right):</li>
<li><em>CorpusViewer</em>: “At the corpus level, we provide a reorderable matrix
to highlight adjacencies between documents and topics.” This view shows
“a re-orderable matrix that connects documents to topics. To address the
“many documents” and “many topics” issues of scale, the matrix supports
filtering and selection, aggregation, and ordering.” “We provide a query-system
that allows users to pick out documents and topics based on their metadata.
Once selected, these sets can be hand-tuned, colored, moved to a more prominent
position in the matrix (typically the top-left corner), used as a basis
for reordering the matrix … or saved to be explored later.”</li>
<li><em>TextViewer</em>: “At the document level, we use tagged text and overview
displays to help readers find and analyze passages in large documents.”
This view shows a “tagged text visualization shows the topics and the text.
To support long documents, a summary graph shows how the topics occur over
the length of the document.”</li>
<li><em>RankViewer</em>: “Finally, at the level of individual words—a level we
only observed the need for after watching users interact with our text
level tool—we introduce a ranking visualization that shows how words are
distributed across the topic.” This view “allows users to examine specific
words and see which topics use them. This tool is useful for relating topics
and words, and comparing different topics and words. It can provide topics
(and orderings of topics) to explore more closely in other views.”</li>
<li>Serendip provides three metrics for ranking relationships between topics
and documents:</li>
<li>Frequency (the percentage of a given topic accounted for each word) – biased
towards words appearing in many topics</li>
<li>Information Gain (the information words gain towards identifying a given
topic) – biased towards rare words that best distinguish topics</li>
<li>Saliency (frequency multiplied by information gain) – finds salient words
across an entire model, not just within a topic. Saliency is the default
ranking metric.</li>
</ul>
</li>
<li><strong>Code site</strong>: <a href="http://vep.cs.wisc.edu/serendip/">Project iPython notebook site with download and use instructions</a>.
(Python 2.7)</li>
<li><em>Scott Kleinman’s</em> <a href="https://github.com/whatevery1says/dev_resources/raw/master/report-on-topic-modeling-interfaces/assets/report-on-serendip.pdf">report-on-serendip.pdf</a>
</li>
<li><em>Scott Kleinman’s</em> <a href="https://github.com/scottkleinman/WE1S/tree/master/serendip">Instructions for implementing Serendip</a>
</li>
</ul>
<h2 id="screen-shots-5">Screen Shots</h2>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-th.jpg"
alt="Serendip - three main views" title="">
</a>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-3-aggregated-data.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-3-aggregated-data-th.jpg"
alt="Serendip - aggregated data" title="">
</a>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-4-term-distribution-and-metadata-th.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-4-term-distribution-and-metadata-th.jpg"
alt="Serendip - term distibution & metadata views" title="">
</a>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-5-topic-words-in-text.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-5-topic-words-in-text-th.jpg"
alt="Serendip - topic words in text view" title="">
</a>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-6-rank-viewer.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-6-rank-viewer-th.jpg"
alt="Serendip - rank viewer" title="">
</a>
</p>
<hr>
<h2 id="termite"><a>Termite</a></h2>
<ul>
<li><strong>Description</strong>:_ <a href="http://vis.stanford.edu/papers/termite">home page</a>
<br>
<ul>
<li>See also: Jason Chuang, Christopher D. Manning, and Jeffrey Heer (2012),
<a href="http://idl.cs.washington.edu/files/2012-Termite-AVI.pdf">“Termite: Visualization Techniques for Assessing Textual Topic Models”</a>
</li>
</ul>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>Termite appears to by a Python system that imports Mallet data files to work
on.</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>: Dynamic visual analysis
tool designed specifically to augment the user’s ability to assess the quality
of topic models and topics.
<br>
<ul>
<li>The main visual design is a matrix view whose columns (labeled by number
on the X axis) are topics, and whose rows are words in those topics (shown
on the Y axis). Circles indicating the occurrence of terms in topics (at
intersection of X and Y axes) are sized to show the following kinds of
metrics:</li>
<li>Word frequency <em>(1st figure on the right)</em> – the bigger the circle,
the more frequent the word in the topic is.</li>
<li>Word “saliency” <em>(compared to frequency in the 2nd figure to the right)</em> – the bigger the circle, the more salient the word in the topic is. (“Saliency”
is calculated in a way that answers the question, to put roughly, “not
only how probable it is that the word occurs in the topic but also how
‘distinctive’ is the word in its relation to the topic. [precise mathematical
definition on p. 2 of the Chaung, Manning, & Heer article]).</li>
<li>The interface allows users to drill down to other views: “Users can drill
down to examine a specifc topic by clicking on a circle or topic label
in the matrix. The visualization then reveals two additional views. The
word frequency view shows the topic’s word usage relative to the full corpus.
The document view shows the representative documents belonging to the topic.”</li>
<li>The interface allows for various ways to order topics and terms in the visualizations.</li>
<li>One of the most important is the ordering of terms in topics through a “seriation
algorithm”. It incorporates in the metrics collocation frequencies of words
with other words <em>(compared to frequency in 3rd figure to the right)</em>.
For example, the right matrix in the 3rd figure at right shows a seriated
view of a topic model in which Topic 25 displays a clear clustering of
collocated terms (the orange circles). Such seriated clusters assist in
identifying key concepts in topics (not just topic words, which may be
hard to understand in their distribution).</li>
<li>Visual analysis tool for assessing topic model quality. Termite uses a tabular
layout to promote comparison of terms both within and across latent topics.
It uses a novel saliency measure for selecting relevant terms and a seriation
algorithm that both reveals clustering structure and promotes the legibility
of related terms.</li>
</ul>
</li>
<li><strong>Code site</strong>: <a href="https://github.com/StanfordHCI/termite">GitHub repo</a> (Python scripts; Python version not specified)
<br>
<ul>
<li>Starting in 2014, Termite was split into two components (separate repos for
each component are linked from the main Termite repo; it does not appear
to have been any code updates for two years):</li>
<li>Termite Data Server “for processing the output of topic models and providing
the content as a web service”</li>
<li>Termite Visualization “for visualizing topic model outputs in a web browse”</li>
</ul>
</li>
<li><strong>Notes by WE1S team</strong>:
<br>
<ul>
<li><em>Scott</em>: I got this mostly working a few years ago, but for some reason
I don’t recall it matching my research needs at the time.</li>
<li><em>Alan</em>: “After spending a week working at implementing Termite, I’ve
concluded that it’s basically not possible on a Windows system. Termite
seems to be basically scripted for Linux or Mac all the way through. On
a windows system, I can’t compile, can’t run scripts, etc. In regard to
Termite: recall that in the past (before 2014), Termite was a simpler system
(and also constrained to topic modeling only a single file, rather than
folder of files, at a time). Now they have forked their code into a data
server (topic modeling creation and local server system), on the one hand,
and a visualization system, on the other hand. It’s the data server that
has me stuck.”</li>
<li><em>Scott: Update of March 21, 2016</em>: “I had a look at the Termite code
again last night–the old code, rather than the new split system. The one
file constraint is actually a file with each document on a separate line.
You could write a file from a folder like that (but probably not one with
30,000 documents). That’s probably how I did it in the past. It seems possible
to inject data at certain points in the pipeline, so you could run Mallet
separately and start Termite at the salience calculation stage. But it
would take a bit of hacking–sadly something I don’t have time to do now.
But it’s something to keep in mind if we find that the client-side lag
time in Serendip is untenable in the future. It may be that neither tool
is built for large data sets.”</li>
</ul>
</li>
</ul>
<h3 id="screen-shots-6">Screen Shots</h3>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-1.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-1-th.jpg"
alt="Termite - word frequency per topic" title="">
</a>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-saliency-comparison.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-saliency-comparison-th.jpg"
alt="Termite - comparison of term frequency vs. saliency rankings for topics"
title="">
</a>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-seriation-comparison.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-seriation-comparison-th.jpg"
alt="Termite - comparison of term frequency vs. seriation rankings for topics"
title="">
</a>
</p>
<hr>
<h2 id="tiara"><a>TIARA</a></h2>
<ul>
<li><strong>Description</strong>:_ <a href="http://users.cis.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p153.pdf">Wei, Furu, et al. “TIARA: A Visual Exploratory Text Analytic System”</a> (2010)</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>An all-in-one, start-to-finish system; does its own topic modeling. At present,
this seems to be a system designed for use in corporate settings to work
with corpora of well-structured documents (such as emails and medical information,
which are the examples in the article).</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
system that is specialized to Lotus Notes and only usable (apparently) by IBM
corporate users.
<br>
<ul>
<li>Tiara does its own topic modeling, and in addition derives time-related data
(e.g., dates of emails in an email corpus) from the documents.</li>
<li>It uses the time-related data to create a timeline of topics in a stratified,
layered view in which each topic-layer is populated by key topic words
and varies in thickness based on weight in the corpus at the time. Clicking
on a topic-layer zoom in on it (widens the layer and shows more topic word
detail).</li>
<li>Topics can be reordered; the system also supports user merging and splitting
of topics.</li>
</ul>
</li>
<li><strong>Code site</strong>: [unknown; apparently not open to the public]</li>
<li><strong>Notes by WE1S team</strong>:</li>
</ul>
<h3 id="screen-shots-7">Screen Shots</h3>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tiara.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tiara-th.jpg"
alt="Tiara" title="">
</a>
</p>
<hr>
<h2 id="tom"><a>TOM</a></h2>
<ul>
<li><strong>Description</strong>: Adrien Guille and Edmundo-Pavel Soriano-Morales
(2016), <a href="http://mediamining.univ-lyon2.fr/people/guille/publications/egc2016_demo.pdf">“TOM: A Library for Topic Modeling and Browsing”</a>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>A start-to-finish system. It takes input from a corpus (optionally supplemented
by metadata on dates, authors, etc.), then pre-processes it by lemmatizing
the text (English or French). It creates two kinds of topic models: LDA
and Non-negative Matrix Factorization (NMF). It also uses algorithms based
on state-of-the-art computer science research on topic models to help the
user optimize the number of topics (see “Parameter Estimation” paragraph
on p. 2 of Guille and Soriano-Morales article).</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>: Dynamic visual exploration
interface with several views:
<br>
<ul>
<li>A <em>“topic cloud” view</em> (figure 1 to the right) shows each topic in
a bubble that is labeled by the most relevant words and whose diameter
indicates its weight in the overall corpus.</li>
<li>A <em>topic view</em>, a <em>text view</em>, (and also <em>author view</em>,
if there is metadata on authors), as in the lower figures to the right.
“For instance, the detailed view about a topic presents the most relevant
features, the evolution of the topic frequency through time, the list of
related texts and the collaboration network that links authors. The detailed
view for a text presents the most significant features, the topic distribution
and the most similar texts. Also, note that some elements may be missing,
depending on the meta-data available with the input corpus.”</li>
</ul>
</li>
<li><strong>Code site</strong>: <a href="https://github.com/AdrienGuille/TOM">Github repo</a> | <a href="https://github.com/AdrienGuille/TOM/blob/master/README.md">Readme.md</a> (TOM is a Python 2.7 library.)</li>
<li><strong>Notes by WE1S team</strong>:</li>
</ul>
<h2 id="screen-shots-8">Screen Shots</h2>
<p>
<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tom.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tom-th.jpg"
alt="TOM" title="">
</a>
</p>
<hr>
<h2 id="tome"><a>TOME</a></h2>
<ul>
<li><strong>Description</strong>:
<br>
<ul>
<li>NEH Digital Humanities Start-up Grant Proposal (2013): Lauren Klein, Principal
Investigator, <a href="http://www.neh.gov/files/grants/georgiatech_interactive_topic_and_metadata_visualization.pdf">“TOME: Interactive TOpic Model and MEtadata Visualization”</a>
</li>
<li>NEH Digital Humanities Start-up Grant White Paper (Final Report), <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/TOMEwhitepaper.pdf">TOMEwhitepaper.pdf</a>
</li>
<li>See also the related publication: Eisenstein, J., I. Sun and L. Klein. “
<a
href="http://dharchive.org/paper/DH2014/Paper-921.xml">Exploratory Text Analysis for Large Document Archives</a>.’ <em>Digital Humanities 2014</em>.</li>
</ul>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>Takes input from Mallet.</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
interface that has two views:
<br>
<ul>
<li><em>“Comparative research in thematic space” view</em> (figure on the top
right) is designed to allow a researcher to compare the topical compositions
of multiple different publications with serial issues (such as newspapers).
To do so, it focuses at any one time on a selected set of publications
in the corpus (e.g., different newspapers) and a selected set of topics
of interest (e.g., topics 9-13 in the figure).
<br> Each publication is represented as a worm-like trail composed of colored
circles.</li>
<li>Each circle corresponds to an issue of the publication;</li>
<li>the color of a circle corresponds to the editor of that run of issues from
the publication;</li>
<li>the size of a circle represents proportionally how much it is infused by
the selected set of topics the user is examining;</li>
<li>the position of the circles on the screen (and thus the topography of the
worm-like trail for a publication) is determined through a kind of push-pull
physics of interaction between the topics being examined. (In detail: <em>“The topics, here represented as squares, exert a ‘magnetic’ force on each point, pulling each point closer to the topics that are more prominent in that issue. For example, if a given newspaper issue contained words from T9 and T13 in equal measure, with no indication of T10, T11, or T12, then the corresponding circle would be positioned in between the squares representing T9 and T13”</em>).</li>
<li><em>Multimodal research in temporal space view</em> (figure on bottom right)
is designed to allow a researcher to compare the trending of topics in
time (and is not focused on specific publications).</li>
<li>Typing a word into the search bar at the top of the view, brings up a panel
list of topics (in the right) ranked by “relevancy” to the search-word
(“relevance—the frequency with which the query appears in each topic”).</li>
<li>The left panel shows the trends lines of the topics in time, with topics
color-coded to match the list of relevant topics. In the trend lines, the
thickness of a line indicates the relative weight of a topic at that time
in the corpus.</li>
<li>Note that “relevance” of a topic to a search-word and weight of the topic
in the corpus have no necessary bearing on each other. A topic can be highly
relevant to a word (if the word is very frequent in the topic), for example,
but the topic as a whole, can be less frequent in the corpus.</li>
</ul>
</li>
<li><strong>Code site</strong>: <a href="https://github.com/GeorgiaTechDHLab/TOME">Github repo</a> (Document file in the repo with instructions is <a href="https://github.com/GeorgiaTechDHLab/TOME/blob/master/README.docx?raw=true">here</a>.
The doc begins: “These files are simple HTML files that call the online version
of the D3.js library, as well as jQuery. Currently, a version of the D3.js
library has been downloaded and saved into the file so the entire project can
be pulled up on a local server. I ran this on my machine using the cmd line
and python version 2.7 “)
<br>
<ul>
<li>Usage and data format notes sent by Lauren Klein to Alan on 1/19/16:
<br>
<blockquote>
“We also generate the topics before formatting them for visualization. We used MALLET
at first, but then had to run a custom topic model since our corpus was
so large. Once you have the topics, the format is just a five-column
CSV, which you can see here:<a href="https://github.com/GeorgiaTechDHLab/TOME/blob/master/a_month_shorter3.csv">https://github.com/GeorgiaTechDHLab/TOME/blob/master/a_month_shorter3.csv</a>
<br> At the moment, the relevance is hand-calculated for one keyword, since
we haven’t built the keyword search function yet, but everything else
should be fairly self-explanatory.
<br> One of the things I’d like to do soon— (which may be a good option for
you) is if you’d like to avoid interacting with MALLET directly, hook
our interface into an instance of Bookworm, since one of the things that
Bookworm does well is offer an API (of sorts) for extracting all sorts
of info from a corpus, including topics. It requires a really huge amount
of disk space, since it tokenizes (or otherwise indexes) every single
word in the corpus beforehand. But in theory, if you can get everything
processed, (and I’ve also had some issues with the initial install, which
I haven’t had time to resolve), the actual hooking into whatever interface
you develop should be relatively easy.”</blockquote>
</li>
</ul>
</li>
<li><strong>Notes by WE1S team</strong>:</li>
</ul>
<h2 id="screen-shots-9">Screen Shots</h2> [![TOME “Trail of Dust” view](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-1-th.jpg)](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-1.jpg)[![TOME
“Multimodal” and Timeline View](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-2-th.jpg)](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-2.jpg)
<hr>
<h2 id="the-topic-browser"><a>The Topic Browser</a></h2>
<ul>
<li><strong>Description</strong>: Gardner, Matthew J, et al., <a href="http://cseweb.ucsd.edu/~lvdmaaten/workshops/nips2010/papers/gardner.pdf">“The Topic Browser: An Interactive Tool for Browsing Topic Models”</a>
</li>
<li><strong>Topic modeling workflow</strong>:
<br>
<ul>
<li>Topic Browser appears to input data from Mallet.</li>
</ul>
</li>
<li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
interface:
<br>
<ul>
<li>The main navigation and exploring tool is a ranked list of topics labeled
by the top two words in each topic <em>(seen on the top figure to the right)</em>.
(The interface also allows the user to navigate by documents, words, and
“attributes” [metadata].)</li>
<li>Because the Topic Browser “incorporates three other pieces of information:
attributes (metadata) associated with each document, topic metrics, and
document metrics,” it can use those metrics to rank and filter the topic
list in various ways to enhance interpretation – e.g., by “simple metrics,
such as the number of word tokens and types labeled with the topic, to
more complicated metrics such as how dispersed the topic is across documents,
or how coherent its words are.”</li>
<li>“When browsing through topics, the user can filter the topic list by coherence
to eliminate from the view those topics that are mostly meaningless and
sorted by document entropy (a measure of the dispersion of the topic across
the documents) to find topics that were used widely throughout the corpus.”</li>
<li>The interface can also show a concordance-like view of topic words in contest,
topic word clouds, top 10 documents associated with each topic, top words