report-on-topic-modeling-interfaces.html

﻿<html>

<head></head>

<body>
    <h1 id="topic-modeling-systems-and-interfaces">Topic Modeling Systems and Interfaces</h1>

    <p>The 4Humanities “WhatEvery1Says” project conducted a comparative analysis in 2016
    of the following topic modeling systems/interfaces. As a result, it chose to
    implement Andrew Goldstone’s DFR-browser for its own work.</p>

    <p>Report first published on November 26, 2017.</p>


    <h2 id="table-of-contents">Table of Contents</h2>

    <ul>
    <li><a href="#convisit">ConVisit</a>
    </li>
    <li><a href="#dfr-browser">DFR-Browser</a>
    </li>
    <li><a href="#inpho">InPHO Topic Explorer</a>
    </li>
    <li><a href="#networked-corpus">The Networked Corpus</a>
    </li>
    <li><a href="#pyldavis">pyLDAvis</a>
    </li>
    <li><a href="#serendip">Serendip</a>
    </li>
    <li><a href="#termite">Termite</a>
    </li>
    <li><a href="#tiara">TIARA</a>
    </li>
    <li><a href="#tom">TOM</a>
    </li>
    <li><a href="#tome">TOME</a>
    </li>
    <li><a href="#topic-browser">The Topic Browser</a>
    </li>
    <li><a href="#topical-guide">Topical Guide</a>
    </li>
    <li><a href="#topicnets">TopicNets</a>
    </li>
    <li><a href="#twic">TWIC</a>
    </li>
    </ul>

    <p>(These following were the materials that the WE1S team researchedin advance of
    its February 18, 2016, meeting focused on choosing and implementing a system/platform/interface
    for the exploration and interpretation of topic models.)</p>

    <hr>


    <h2 id="convisit"><a>ConVisIT</a></h2>

    <ul>
    <li><strong>Description</strong>: E. Hoque and Giuseppe Carenini (2015), <a href="http://www.cs.ubc.ca/~carenini/TEACHING/CPSC503-16/READINGS/iui0167-paper-SUBMITTED.pdf">“ConVisIT: Interactive Topic Modeling for Exploring Asynchronous Online Conversations”</a>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>An all-in-one, start-to-finish system that does its own topic modeling of
            a corpus.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>:
        <br>
        <ol>
        <li>Interactive visualization interface designed for topic modeling asynchronous
            conversation on the Internet (email, blog comments, etc.).</li>
        <li>Interface shows overall conversation (left panel in figure)</li>
        <li>Interface also shows the actual conversation (right panel in figure)</li>
        <li>“Human-in-the-loop” feature to allow humans iteratively to assess results
            of a topic model and tweak it interactively in a sense-making activity–e.g.,
            change granularity of topics, merge or split topics, suppress a topic,
            specify that words must (or must not) be in a topic,</li>
        <li>Has an algorithm for automatic labeling of topics. (p. 4 of PDF)</li>
        <li>Feedback on the interface was assessed through a user study.</li>
        </ol>
    </li>
    <li><strong>Code site</strong>: [unknown]</li>
    <li><strong>Notes by WE1S team</strong>:
        <br>
        <ul>
        <li><em>Alan</em>: Can it be adapted for articles?</li>
        </ul>
    </li>
    </ul>


    <h3 id="screen-shots">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/ConVisIT.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/ConVisIT-th.jpg"
        alt="ConVisIT" title="">
    </a>
    </p>

    <hr>


    <h2 id="dfr-browser"><a>DFR-Browser</a></h2>

    <ul>
    <li><strong>Description</strong>: Andrew Goldstone, <a href="http://agoldst.github.io/dfr-browser/">“Dfr-Browser: Take a MALLET to Disciplinary History”</a>      (2013)</li>
    <li><strong>Demos</strong>: <a href="http://agoldst.github.io/dfr-browser/demo/">Topics in</a>      <em><a href="http://agoldst.github.io/dfr-browser/demo/">PMLA</a></em> |
        <a
        href="http://signsat40.signsjournal.org/topic-model/">Topics in</a> <em><a href="http://signsat40.signsjournal.org/topic-model/">Signs</a></em>        | <a href="http://jgoodwin.net/htb/">Hathi Trust Fiction 1920-22</a>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>Out of the box, Goldstone’s DFR-Browser is specialized to take input from
            Jstor’s <a href="http://about.jstor.org/service/data-for-research">DFR (Data for Research)</a>          service, run it through Mallet using Goldstone’s companion R package,
            <a
            href="http://github.com/agoldst/dfrtopics">dfrtopics</a>, and then use <em>d3</em> to generate a dynamic visual exploration
            interface. This start-to-finish workflow is modularized, however, allow
            for the use of alternative methods for generating the topic models and
            formatted data files that DFR-Browser expects:</li>
        <li>Instead of using Goldstone’s R package to generate the Mallet topic model
            and then create the specially formatted data files for the DFR-Browser,
            a user can run Mallet and output the formattted data files entirely on
            the command line. (See <a href="https://github.com/agoldst/dfr-browser#preparing-data-files-entirely-on-the-command-line">instructions here</a>          in the Github repo.)</li>
        <li>There is also a section in the Github repo titled <a href="https://github.com/agoldst/dfr-browser#browser-data-file-specifications">“Browser data file specifications”</a>          that gives detailed instructions about the format and nature of the data
            files that DFR-Browser expects. (In principle, this should allow topic
            model files that were pre-generated in other ways to be converted into
            data files for DFR-Browser.)</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: A dynamic visual
        exploration interface with several main views:
        <br>
        <ul>
        <li><em>“Overview”</em> (top figure at right) showing topics as circles laid
            out in a regular grid, each circle labeled with the six most important
            words in a topic. Clicking on a topic brings the user to →</li>
        <li><em>“Topic” view</em> (bottom figure at right) showing a ranked list of words
            in the topic with bars representing relative weight (left panel), a timeline
            graph of the topic’s weight in the corpus, and a list of articles ranked
            by the amount of the topic infused in them.</li>
        <li>Clicking on a word in the ranked list of topic words brings the user to →
            <em>“Word” view</em>, which shows what other topics the word appears in
            (and its relative weight in that topic).</li>
        <li>Clicking on a document brings the user to → <em>“Document” view</em>, which
            shows a ranked list of other topics in that document (and their relative
            weights in the document)</li>
        <li>Clicking on a bar in the timeline graph of a topic brings the user to → a
            view showing the top documents in that year infused by that topic.</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="https://github.com/agoldst/dfr-browser">GitHub repo</a>.
        The following sections of the documentation on Github indicate that we might
        be able to generate the topic-modeling and other data files for the DFR-Browser
        with the WE1S material:
        <br>
        <ul>
        <li><a href="https://github.com/agoldst/dfr-browser#preparing-data-files-entirely-on-the-command-line">“Preparing data files entirely on the command line”</a>
        </li>
        <li><a href="https://github.com/agoldst/dfr-browser#adapting-this-project-to-other-kinds-of-documents">“Adapting this project to other kinds of documents”</a>
        </li>
        </ul>
    </li>
    <li><strong>Notes by WE1S team</strong>:
        <br>
        <ul>
        <li><em>Alan</em>: we should try using the command line as instructed in the
            documentation to see if we can create the data files in the format needed
            for DFR-Browser</li>
        <li><em>Lindsay</em>: I kind of got this to work. With Goldstone’s new instructions
            and the prepare-data python script, I was able to figure out how to get
            the data into the format the browser needs in the command line. I was also
            able to do this using the dfrtopics package in R with a pre-run mallet
            file (which I did in the command line). However, I haven’t yet figured
            out how to properly configure the browser’s main js file so that it will
            all display properly (right now only the overview view works as it should).
            Also, I tried this on a very small subset of our corpus (10 articles from
            the NYT) because getting the metadata into the right shape so that dfrtopics/the
            prepare-data python script can read it properly is still something I don’t
            know how to do. I just did it manually, and it worked, but we would have
            to create a script that could wrangle metadata for us in order to do this
            for a larger number of documents.</li>
        </ul>
    </li>
    </ul>


    <h3 id="screen-shots-1">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-1.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-1-th.jpg"
        alt="DFR-Browser, multiple topics view" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-2.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/dfr-browser-2-th.jpg"
        alt="DFR-Browser, single topic view" title="">
    </a>
    </p>

    <hr>


    <h2 id="inpho-topic-explorer"><a>InPhO Topic Explorer</a></h2>

    <ul>
    <li><strong>Description (Demos)</strong>: <a href="http://inphodata.cogs.indiana.edu/">home page</a>
    </li>
    <li>See also Jaimie Murdock and Colin Allen (2015), <a href="http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/10007/9852">“Visualization Techniques for Topic Model Checking”</a>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>An all-in-one, start-to-finish system; does its own topic modeling of a corpus.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>:
        <br>
        <ul>
        <li>Interactive visual exploration interface that shows:</li>
        <li>List of all documents in a corpus (file names listed vertically down the
            left, with the first line of the document showing to help the user grok
            the article). Documents are ranked according to the weight of the topic
            that is a user’s “focus” at present. (E.g., if a user is examining Topic
            1, then the document where Topic 1 is most prevalent will be at the top.)</li>
        <li>Bands of color superimposed over every document filename and first line that
            are color-coded to the topics in the topic model (where the legend for
            topic colors is at the right)</li>
        <li>The relative size of each color band among all the colors in a document indicates
            the weight of specific topics in the article.</li>
        <li>The cumulative width of the color bands for a document indicates the similarity
            of the document to the user’s current “focus” topic (or document).</li>
        <li>Clicking on a color anywhere reset’s the “focus” to the topic corresponding
            to that color, with the whole list of articles and color bands shifting
            to reorient around that topic (ranked with the articles most expressive
            of that topic at the top).</li>
        <li>When a topic is selected, clicking the “Top Documents for [Topic]” button
            at lower right of the interface “will take you to a new page showing the
            most similar documents to that topic’s word distribution.”</li>
        <li>There is a search function to identify which documents in a corpus contain
            a word.</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="https://github.com/inpho/topic-explorer">GitHub repo</a>      (an Anaconda 2.7 Python distribution)</li>
    <li><strong>Notes by WE1S team</strong>:
        <br>
        <ul>
        <li><em>Alan</em>: This is a package for Python 2.7. I tried to install, but
            installation of the underlying VSM module failed. One issue that appeared
            in the error messages: “error: Microsoft Visual C++ 9.0 is required (Unable
            to find vcvarsall.bat). Get it from <a href="http://aka.ms/vcpython27">http://aka.ms/vcpython27</a>”
            I’ve seen this error before when a package installation on a Windows machines
            calls on Visual C++ as its compiler, but the particular machine does not
            have Visual C++.</li>
        <li>Scott: The above issue has supposedly been addressed (as of May 25, 2016),
            so it might be worth pulling the repo again. The tool has some interesting
            features that might help in defining stopword lists.</li>
        </ul>
    </li>
    </ul>


    <h3 id="screen-shots-2">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/inpho.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/inpho-th.jpg"
        alt="InPhO" title="">
    </a>
    </p>

    <hr>


    <h2 id="the-networked-corpus"><a>The Networked Corpus</a></h2>

    <ul>
    <li><strong>Description</strong>: Jeff Binder and Collin Jennings, <a href="http://www.networkedcorpus.com/">The Networked Corpus</a>
        <br>
        <ul>
        <li>See also article: Jeffrey Binder and Collin Jennings (2014), <a href="http://llc.oxfordjournals.org/content/29/3/405.full">“Visibility and Meaning in Topic Models and 18th-Century Indexes”</a>          (2014)</li>
        </ul>
    </li>
    <li><strong>Topic modeling workflow</strong>: Takes input from Mallet.</li>
    <li>
        <p>** <strong>Notable interpretive features of interface</strong>:
        <br>
        </p>
        <ul>
        <li>Interactive visualization interface that takes input from Mallet. The key
            design principle of the interface is to avoid using topic labels (which
            can be deceptive) but instead to provide an easy way to identify passages
            in documents that are “dense” with a particular topic, allow them to be
            compared to other passages also dense with the topic, and thus provide
            the user with an understanding of a topic’s meaning built up from intertextual
            context.
            <br> In particular, the interface:</li>
        <li>Shows a document in the left panel; and shows a list of topics (by number)
            in the right panel</li>
        <li>Choosing a topic number highlights in the document the words that belong
            to that topic</li>
        <li>A line graphs the topic “density” of passages in the document, with peaks
            indicated with asterisk. Clicking on the asterisk calls up a list of links
            to other passages (including in other documents) that are dense with that
            topic.
            <br>
        </li>
        </ul>
        <p></p>

        <blockquote>
        <p>(The density functions in the interface are calculated using Mallet’s topic-state
            file as follows: <em>“the density function is computed using kernel density estimation, which takes into account the words in nearby lines. Using these density functions, the program picks out ‘exemplary passages’ for each topic based on a simple rubric. Passages are only selected if the topic matches at least a certain number of words in the text (default 25), and they are only added if the topic’s maximum density in the text is at least (by default) four times as high as the average density of the whole document. If both of these conditions are met, an asterisk is created at the point of greatest density, with links to every other asterisk that was created for that topic.”</em>)</p>
        </blockquote>
    </li>
    </ul>
    <br>
    <em>Note</em>: one interesting theoretical tenet of The Networked Corpus is that
    topic modeling produces an apparatus for understanding texts and moving around
    them non-linearly in a way analogous to earlier “indexing” (and other such apparatus)
    in the history of writing and print.

    <li><strong>Code site</strong>: <a href="https://github.com/jeffbinder/networkedcorpus">GitHub repo</a>
    </li>
    <li><strong>Notes by WE1S team</strong>:
    <br>
    <ul>
        <li><em>Scott</em>: <strong><a href="https://github.com/scottkleinman/WE1S/tree/master/networkedcorpus">Instructions for implementing The Networked Corpus</a></strong>.;
        includes adapted version of the code files as <a href="https://raw.githubusercontent.com/scottkleinman/WE1S/master/networkedcorpus/networkedcorpus.zip">zip file</a>;
        currently the instructions are for implementation on a Windows machine.</li>
        <li><em>Alan’s</em> <strong><a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/page/106519659/Alan%27s%20Instructions%20for%20Implementing%20The%20Networked%20Corpus">step-by-step version of Scott’s instructions</a></strong>,
        including a temporary kludge solution for non-ASCII character problems (arrived
        at after debugging correspondence with Scott below).</li>
        <li><em>Alan</em>: This set of instructions gets The Networked Corpus to run. However,
        the results are not as expected and do not match the screenshots seen at
        the right. At first, my thought was that the <em>browser.css</em> and <em>index.css</em>        files generated by the <em>gen-networked-corpus.py</em> script, together
        with the HTML in the html versions of each original text file also generated
        by that script, need to be tweaked for today’s browsers. However, after investigation
        and experiments, that seems not to be the case.
        <br> Instead, the problem seems to lie in a mismatch between our input text format
        and that expected by Networked Corpus. Here are the relevant instructions
        in the Networked Corpus Github site: <em>“The text files must have hard line breaks at the end of each line. This is used to calculate how far down the page a word occurs, and also affects how wide the text will appear in the browser. If your source documents do not have line breaks and you are on a Mac or Linux system, you can use the ‘fold’ command to wrap them automatically. It doesn’t matter whether the line breaks are Unix or DOS-style. Finally, the first line of each file should be a title; this will be used in the table of contents and in a few other places.”</em>
        <br> The plain-text article files in the WE1S corpus have no line breaks. Less
        important, they are not formatted in a way that makes the first line the
        title. (Instead, Networked Corpus ends up treating the entire text as a title,
        placing that in the title element in the head of each of the HTML file version
        it generates for an article.)
        <br> When using Mallet to create a topic model, the format of the original plain-text
        file and the presence or absence of line breaks is irrelevant. However, the
        way Networked Corpus seems to work is that it creates a HTML version of each
        original plain text file, which when opened in a browser is correlated via
        javascript scripts to the Mallet data about that file on a token-by-token
        basis. The format of the original plain-text files and the presence or absence
        of line breaks has an impact on these HTML files and they way they are displayed
        in a browser. In particular, Networked Corpus creates in the HTML page for
        an article a table of the text in which topic words for a chosen topic are
        highlighted. Each “line” of a text is supposed to be a single row, so that
        the table extends down the page row-by-row. But if the original plain-text
        file has no line breaks, then there is just a single row extending off the
        right of the page, nullifying the whole point of Networked Corpus’s document
        view of the topic model.
        <br> It seem that the next step is to try the “fold” command referred to in the
        instructions from Networked Corpus’s Github site above on the WE1S article
        files and see what we get.</li>
    </ul>
    </li>
    <li>Debugging issues to date leading to above implementation solution:</li>
    <li><em>Alan’s error report</em> in response to Scott’s initial instructions for implementing
    The Networked Corpus. Implementation produced the following error:</li>


    <pre class="prettyprint"><code class="language-python hljs ">    ___init__.py<span class="hljs-string">", line 586, in &lt;module&gt;_
    _    from ._ufuncs import *_
    _ImportError: DLL load failed: The specified module could not be found._</span></code></pre>

    <ul>
    <li><em>Scott on the DLL load error</em>: “I think the answer might be <a href="http://stackoverflow.com/questions/31596125/python-dll-load-failed">here</a>.
        Try running <code>conda update scipy</code> from the command line.”</li>
    <li><em>Alan’s error report</em>: “This worked. I’m finally getting gen-networked-corpus.py
        to run.
        <br> However, I’m now getting a unicode error:</li>
    </ul>


    <pre class="prettyprint"><code class="language-python hljs ">      _File <span class="hljs-string">"C:\Users\Alan\Anaconda\lib\codecs.py"</span>, line <span class="hljs-number">492</span>, <span class="hljs-keyword">in</span> read_
    _    newchars, decodedbytes = self.decode(data, self.errors)_
    _UnicodeDecodeError: <span class="hljs-string">'utf8'</span> codec can<span class="hljs-string">'t decode byte 0xac in position 0: invalid start byte_
    I created the .mallet file for the Mallet topic model using the regex parameter you
        suggested: _--token-regex "[\p{L}\p{M}]+"_</span></code></pre>

    <pre><code>I'm guessing this is the kind of error that caused you to start debugging the unicode problems in the first place. Let me know if you have any suggestions."
</code></pre>

    <ul>
    <li><em>Scott’s response</em>: “The line causing the Unicode error is part of a loop
        through a directory file list, so it seems to run into problems if the directory
        contains something other than the text files you are using to generate your
        topic model, This includes the Mallet output. When I set line 303 to a directory
        containing only the text files (in this case, one of your early New York Times
        collections), I didn’t get the error.</li>
    </ul>

    <p>Unfortunately, I got another error at the next stage, where the script was getting
    hung up at the name “François”. Obviously, we can avoid this problem by striping
    diacritics,but we shouldn’t have to. When I get a chance, I’ll try to figure
    it out. But go ahead and try the advice in the previous paragraph, and see if
    it works for you.”</p>

    <ul>
    <li><em>Alan’s error report</em>: “Thanks, Scott. I see. I was misunderstanding what
        the “datadir=” in line 303 is supposed to point to: the directory of original
        text files and not the directory of Mallet output files for the topic model
        of those text files.</li>
    </ul>

    <p>Unfortunately, after getting that right I am getting another Unicode error that
    may be indicating an unexpected character in the plain text (just as you did):</p>


    <pre class="prettyprint"><code class="language-python hljs ">      _File <span class="hljs-string">"C:\Users\Alan\Anaconda\lib\encodings\cp437.py"</span>, line <span class="hljs-number">12</span>, <span class="hljs-keyword">in</span> encode_
    _    <span class="hljs-keyword">return</span> codecs.charmap_encode(input,errors,encoding_map)_
    _UnicodeEncodeError: <span class="hljs-string">'charmap'</span> codec can<span class="hljs-string">'t encode character u'</span>\u0301<span class="hljs-string">' in position 72: character maps to &lt;undefined&gt;_</span></code></pre>

    <ul>
    <li><em>Alan’s temporary kludge solution to the above error</em>: Use “search and
        replace” in Notepad++ (set to regex) to delete all non-ASCII characters in
        the article files being topic modeled for The Networked Corpus. (<a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/page/106519659/Alan%27s%20Instructions%20for%20Implementing%20The%20Networked%20Corpus">See instructions</a>)</li>
    <li>(<em>Scott’s earlier notes</em>: Some observations: The script must be run from
        within the input directory, and the script expects the Mallet output files
        to be named as shown in the sample command on GitHub. When I ran it, I encountered
        Unicode errors, so I tried a model using <code>&lt;span style="font-family:'Courier New';"&gt;--token-regex '[\p{L}\p{M}]+'</code>,
        as suggested on the GitHub repo. However, this caused the Mallet train-topics
        command to fail. Apparently, in Windows the regular expression <em>must</em>      be enclosed in double quotes. The Python script also seems to have substantial
        problems with character encoding and/or Windows. I am hacking my way through
        it, gradually getting closer to a full implementation, but at the moment I’m
        stuck on a particularly confusing block of code. <strong>Update</strong>: I
        have never actually managed to get <code>&lt;span style="font-family:'Courier New';"&gt;--token-regex</code>      to work in Mallet, so the point about double quotes is important independent
        of the Networked Corpus tool. As for the tool itself, I have finally managed
        to get it to run all the way through. I had to hack the code and inject my
        own paths to get it to pull data from the right folders. The result was a little
        disappointing, as it produced buggy html/css/javascript (or some combination
        of those). The following information is readable. Document Index, Topic Index,
        top 10 topics in each document, top 10 documents in each topic. The script
        is supposed to choose “exemplary passages” if the topic matches 25 words in
        the text and the topic’s maximum density in the text is at least 4 times as
        high as the average over the whole document. There did not appear to be any
        “exemplary passages”, perhaps because I used Mallet’s tiny sample data set
        to build my model. Supposedly, if both of these conditions are met, an asterisk
        is created at the point of greatest density, with links to every other asterisk
        that was created for that topic. From the images displayed on the website,
        this appears to be a visualisation function using protovis.js. Either the javascript
        failed or it wasn’t called simply because my data did not produce any exemplary
        passages.)</li>
    <li><em>Alan</em>: Just to add information that may, or may not. be relevant to Scott’s
        original problem with Unicode issues as documented in his note: the WE1S scraping
        workflow saves all plain-text files in UTF-8.</li>
    </ul>


    <h3 id="screen-shots-3">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/networked-corpus.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/networked-corpus-th.jpg"
        alt="Networked Corpus" title="">
    </a>
    <br>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus1.PNG"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus1.PNG"
        alt="Networked Corpus" title="">
    </a>
    <br>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus2.PNG"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus2.PNG"
        alt="Networked Corpus" title="">
    </a>
    <br>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/NetworkedCorpus2.PNG">Click for a readable image</a>
    </p>

    <hr>


    <h2 id="pyldavis"><a>pyLDAvis</a></h2>

    <ul>
    <li><strong>Description</strong>: Ben Mabey and Paul English, <a href="https://github.com/bmabey/pyLDAvis">pyLDAvis</a>
    </li>
    <li>A Python port of the LDAvis R package.</li>
    <li>For a concise explanation of the visualization see this <a href="http://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf">vignette</a>      from the LDAvis R package.</li>
    <li>GitHub</li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>Takes input from multiple types of topic models.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>:
        <br>
        <ul>
        <li>The GitHub site links to numerous examples and demos.</li>
        </ul>
    </li>
    <li><strong>Notes by WE1S team</strong>:
        <br>
        <ul>
        <li><em>Not yet reviewed.</em>
        </li>
        <li>Requires scikit-bio package, which is not yet supported for Windows. Windows
            support is scheduled for July 2016. In the meantime, there may be a workaround
            <a href="http://stackoverflow.com/questions/27029212/trouble-installing-scikit-bio-on-windows-xp">here</a>,
            but it has not yet been tested.</li>
        <li>Additionally, scikit-bio is now no longer compatible with Python 2 and thus
            would require a separate Python 3 virtual environment (although that’s
            relatively easy to do in an Anaconda installation). It may be worth looking
            into using the <a href="https://github.com/cpsievert/LDAvis">R package</a>.</li>
        </ul>
    </li>
    </ul>


    <h3 id="screen-shots-4">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/file/106758792/pyLDAvis1.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/pyLDAvis1.png"
        alt="pyLDAvis" title="">
    </a>
    <br>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/w/file/106758792/pyLDAvis2.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/pyLDAvis2.png"
        alt="pyLDAvis" title="">
    </a>
    </p>

    <hr>


    <h2 id="serendip"><a>Serendip</a></h2>

    <p><em>Note: Parts of the descriptions and screenshots in the mini-report on Serendip here are excerpted from Scott Kleinman’s <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/pdf/report-on-serendip.pdf">report-on-serendip.pdf</a>. Other descriptions and one screenshot are based on the Eric Alexander et al. article.</em>
    </p>

    <ul>
    <li><strong>Description</strong>: Eric Alexander, et al. (2014), <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=2&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwiti_zIh7TKAhUBTGMKHTi_At0QFggiMAE&amp;url=https%3A%2F%2Fgraphics.cs.wisc.edu%2FPapers%2F2014%2FAKVWG14%2FPreprint.pdf&amp;usg=AFQjCNG-VY5ModzUaOQo8TrvVefKg50a5w&amp;sig2=d-jzuGMxh9yFNjkrud5ghw">“Serendip: Topic Model-Driven Visual Exploration of Text Corpora”</a>      (preprint)
        <br>
        <ul>
        <li>See also the <a href="http://vep.cs.wisc.edu/serendip/">Project iPython notebook site</a>.</li>
        </ul>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>Serendip runs in a Python-Flask environment. It comes with a separate command-line
            tool to call Mallet and generate topic models. The Mallet output data is
            deposited in a Corpora folder and can then be accessed by the Serendip
            interface. In addition to implementing Mallet, the command-line tool generates
            multiple files used by the interface to navigate and manipulate the data.
            Therefore, Serendip will not work if independently generated Mallet output
            files are deposited in the Corpora folder. It is possible that the script
            could be modified to read independently generated Mallet data, but this
            would require some hacking of the Python script.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>:
        <br>
        <ul>
        <li>Serendip is designed to give a view of a topic model at many scales, and
            with much connection between views. There are three main views (see 1st
            figure at right):</li>
        <li><em>CorpusViewer</em>: “At the corpus level, we provide a reorderable matrix
            to highlight adjacencies between documents and topics.” This view shows
            “a re-orderable matrix that connects documents to topics. To address the
            “many documents” and “many topics” issues of scale, the matrix supports
            filtering and selection, aggregation, and ordering.” “We provide a query-system
            that allows users to pick out documents and topics based on their metadata.
            Once selected, these sets can be hand-tuned, colored, moved to a more prominent
            position in the matrix (typically the top-left corner), used as a basis
            for reordering the matrix … or saved to be explored later.”</li>
        <li><em>TextViewer</em>: “At the document level, we use tagged text and overview
            displays to help readers find and analyze passages in large documents.”
            This view shows a “tagged text visualization shows the topics and the text.
            To support long documents, a summary graph shows how the topics occur over
            the length of the document.”</li>
        <li><em>RankViewer</em>: “Finally, at the level of individual words—a level we
            only observed the need for after watching users interact with our text
            level tool—we introduce a ranking visualization that shows how words are
            distributed across the topic.” This view “allows users to examine specific
            words and see which topics use them. This tool is useful for relating topics
            and words, and comparing different topics and words. It can provide topics
            (and orderings of topics) to explore more closely in other views.”</li>
        <li>Serendip provides three metrics for ranking relationships between topics
            and documents:</li>
        <li>Frequency (the percentage of a given topic accounted for each word) – biased
            towards words appearing in many topics</li>
        <li>Information Gain (the information words gain towards identifying a given
            topic) – biased towards rare words that best distinguish topics</li>
        <li>Saliency (frequency multiplied by information gain) – finds salient words
            across an entire model, not just within a topic. Saliency is the default
            ranking metric.</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="http://vep.cs.wisc.edu/serendip/">Project iPython notebook site with download and use instructions</a>.
        (Python 2.7)</li>
    <li><em>Scott Kleinman’s</em> <a href="https://github.com/whatevery1says/dev_resources/raw/master/report-on-topic-modeling-interfaces/assets/report-on-serendip.pdf">report-on-serendip.pdf</a>
    </li>
    <li><em>Scott Kleinman’s</em> <a href="https://github.com/scottkleinman/WE1S/tree/master/serendip">Instructions for implementing Serendip</a>
    </li>
    </ul>


    <h2 id="screen-shots-5">Screen Shots</h2>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-th.jpg"
        alt="Serendip - three main views" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-3-aggregated-data.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-3-aggregated-data-th.jpg"
        alt="Serendip - aggregated data" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-4-term-distribution-and-metadata-th.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-4-term-distribution-and-metadata-th.jpg"
        alt="Serendip - term distibution &amp; metadata views" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-5-topic-words-in-text.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-5-topic-words-in-text-th.jpg"
        alt="Serendip - topic words in text view" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-6-rank-viewer.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/serendip-6-rank-viewer-th.jpg"
        alt="Serendip - rank viewer" title="">
    </a>
    </p>

    <hr>


    <h2 id="termite"><a>Termite</a></h2>

    <ul>
    <li><strong>Description</strong>:_ <a href="http://vis.stanford.edu/papers/termite">home page</a>
        <br>
        <ul>
        <li>See also: Jason Chuang, Christopher D. Manning, and Jeffrey Heer (2012),
            <a href="http://idl.cs.washington.edu/files/2012-Termite-AVI.pdf">“Termite: Visualization Techniques for Assessing Textual Topic Models”</a>
        </li>
        </ul>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>Termite appears to by a Python system that imports Mallet data files to work
            on.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: Dynamic visual analysis
        tool designed specifically to augment the user’s ability to assess the quality
        of topic models and topics.
        <br>
        <ul>
        <li>The main visual design is a matrix view whose columns (labeled by number
            on the X axis) are topics, and whose rows are words in those topics (shown
            on the Y axis). Circles indicating the occurrence of terms in topics (at
            intersection of X and Y axes) are sized to show the following kinds of
            metrics:</li>
        <li>Word frequency <em>(1st figure on the right)</em> – the bigger the circle,
            the more frequent the word in the topic is.</li>
        <li>Word “saliency” <em>(compared to frequency in the 2nd figure to the right)</em>          – the bigger the circle, the more salient the word in the topic is. (“Saliency”
            is calculated in a way that answers the question, to put roughly, “not
            only how probable it is that the word occurs in the topic but also how
            ‘distinctive’ is the word in its relation to the topic. [precise mathematical
            definition on p. 2 of the Chaung, Manning, &amp; Heer article]).</li>
        <li>The interface allows users to drill down to other views: “Users can drill
            down to examine a specifc topic by clicking on a circle or topic label
            in the matrix. The visualization then reveals two additional views. The
            word frequency view shows the topic’s word usage relative to the full corpus.
            The document view shows the representative documents belonging to the topic.”</li>
        <li>The interface allows for various ways to order topics and terms in the visualizations.</li>
        <li>One of the most important is the ordering of terms in topics through a “seriation
            algorithm”. It incorporates in the metrics collocation frequencies of words
            with other words <em>(compared to frequency in 3rd figure to the right)</em>.
            For example, the right matrix in the 3rd figure at right shows a seriated
            view of a topic model in which Topic 25 displays a clear clustering of
            collocated terms (the orange circles). Such seriated clusters assist in
            identifying key concepts in topics (not just topic words, which may be
            hard to understand in their distribution).</li>
        <li>Visual analysis tool for assessing topic model quality. Termite uses a tabular
            layout to promote comparison of terms both within and across latent topics.
            It uses a novel saliency measure for selecting relevant terms and a seriation
            algorithm that both reveals clustering structure and promotes the legibility
            of related terms.</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="https://github.com/StanfordHCI/termite">GitHub repo</a>      (Python scripts; Python version not specified)
        <br>
        <ul>
        <li>Starting in 2014, Termite was split into two components (separate repos for
            each component are linked from the main Termite repo; it does not appear
            to have been any code updates for two years):</li>
        <li>Termite Data Server “for processing the output of topic models and providing
            the content as a web service”</li>
        <li>Termite Visualization “for visualizing topic model outputs in a web browse”</li>
        </ul>
    </li>
    <li><strong>Notes by WE1S team</strong>:
        <br>
        <ul>
        <li><em>Scott</em>: I got this mostly working a few years ago, but for some reason
            I don’t recall it matching my research needs at the time.</li>
        <li><em>Alan</em>: “After spending a week working at implementing Termite, I’ve
            concluded that it’s basically not possible on a Windows system. Termite
            seems to be basically scripted for Linux or Mac all the way through. On
            a windows system, I can’t compile, can’t run scripts, etc. In regard to
            Termite: recall that in the past (before 2014), Termite was a simpler system
            (and also constrained to topic modeling only a single file, rather than
            folder of files, at a time). Now they have forked their code into a data
            server (topic modeling creation and local server system), on the one hand,
            and a visualization system, on the other hand. It’s the data server that
            has me stuck.”</li>
        <li><em>Scott: Update of March 21, 2016</em>: “I had a look at the Termite code
            again last night–the old code, rather than the new split system. The one
            file constraint is actually a file with each document on a separate line.
            You could write a file from a folder like that (but probably not one with
            30,000 documents). That’s probably how I did it in the past. It seems possible
            to inject data at certain points in the pipeline, so you could run Mallet
            separately and start Termite at the salience calculation stage. But it
            would take a bit of hacking–sadly something I don’t have time to do now.
            But it’s something to keep in mind if we find that the client-side lag
            time in Serendip is untenable in the future. It may be that neither tool
            is built for large data sets.”</li>
        </ul>
    </li>
    </ul>


    <h3 id="screen-shots-6">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-1.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-1-th.jpg"
        alt="Termite - word frequency per topic" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-saliency-comparison.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-saliency-comparison-th.jpg"
        alt="Termite - comparison of term frequency vs. saliency rankings for topics"
        title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-seriation-comparison.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/termite-2-term-frequency-vs-seriation-comparison-th.jpg"
        alt="Termite - comparison of term frequency vs. seriation rankings for topics"
        title="">
    </a>
    </p>

    <hr>


    <h2 id="tiara"><a>TIARA</a></h2>

    <ul>
    <li><strong>Description</strong>:_ <a href="http://users.cis.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p153.pdf">Wei, Furu, et al. “TIARA: A Visual Exploratory Text Analytic System”</a>      (2010)</li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>An all-in-one, start-to-finish system; does its own topic modeling. At present,
            this seems to be a system designed for use in corporate settings to work
            with corpora of well-structured documents (such as emails and medical information,
            which are the examples in the article).</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
        system that is specialized to Lotus Notes and only usable (apparently) by IBM
        corporate users.
        <br>
        <ul>
        <li>Tiara does its own topic modeling, and in addition derives time-related data
            (e.g., dates of emails in an email corpus) from the documents.</li>
        <li>It uses the time-related data to create a timeline of topics in a stratified,
            layered view in which each topic-layer is populated by key topic words
            and varies in thickness based on weight in the corpus at the time. Clicking
            on a topic-layer zoom in on it (widens the layer and shows more topic word
            detail).</li>
        <li>Topics can be reordered; the system also supports user merging and splitting
            of topics.</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: [unknown; apparently not open to the public]</li>
    <li><strong>Notes by WE1S team</strong>:</li>
    </ul>


    <h3 id="screen-shots-7">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tiara.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tiara-th.jpg"
        alt="Tiara" title="">
    </a>
    </p>

    <hr>


    <h2 id="tom"><a>TOM</a></h2>

    <ul>
    <li><strong>Description</strong>: Adrien Guille and Edmundo-Pavel Soriano-Morales
        (2016), <a href="http://mediamining.univ-lyon2.fr/people/guille/publications/egc2016_demo.pdf">“TOM: A Library for Topic Modeling and Browsing”</a>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>A start-to-finish system. It takes input from a corpus (optionally supplemented
            by metadata on dates, authors, etc.), then pre-processes it by lemmatizing
            the text (English or French). It creates two kinds of topic models: LDA
            and Non-negative Matrix Factorization (NMF). It also uses algorithms based
            on state-of-the-art computer science research on topic models to help the
            user optimize the number of topics (see “Parameter Estimation” paragraph
            on p. 2 of Guille and Soriano-Morales article).</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: Dynamic visual exploration
        interface with several views:
        <br>
        <ul>
        <li>A <em>“topic cloud” view</em> (figure 1 to the right) shows each topic in
            a bubble that is labeled by the most relevant words and whose diameter
            indicates its weight in the overall corpus.</li>
        <li>A <em>topic view</em>, a <em>text view</em>, (and also <em>author view</em>,
            if there is metadata on authors), as in the lower figures to the right.
            “For instance, the detailed view about a topic presents the most relevant
            features, the evolution of the topic frequency through time, the list of
            related texts and the collaboration network that links authors. The detailed
            view for a text presents the most significant features, the topic distribution
            and the most similar texts. Also, note that some elements may be missing,
            depending on the meta-data available with the input corpus.”</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="https://github.com/AdrienGuille/TOM">Github repo</a>      | <a href="https://github.com/AdrienGuille/TOM/blob/master/README.md">Readme.md</a>      (TOM is a Python 2.7 library.)</li>
    <li><strong>Notes by WE1S team</strong>:</li>
    </ul>


    <h2 id="screen-shots-8">Screen Shots</h2>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tom.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tom-th.jpg"
        alt="TOM" title="">
    </a>
    </p>

    <hr>


    <h2 id="tome"><a>TOME</a></h2>

    <ul>
    <li><strong>Description</strong>:
        <br>
        <ul>
        <li>NEH Digital Humanities Start-up Grant Proposal (2013): Lauren Klein, Principal
            Investigator, <a href="http://www.neh.gov/files/grants/georgiatech_interactive_topic_and_metadata_visualization.pdf">“TOME: Interactive TOpic Model and MEtadata Visualization”</a>
        </li>
        <li>NEH Digital Humanities Start-up Grant White Paper (Final Report), <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/TOMEwhitepaper.pdf">TOMEwhitepaper.pdf</a>
        </li>
        <li>See also the related publication: Eisenstein, J., I. Sun and L. Klein. “
            <a
            href="http://dharchive.org/paper/DH2014/Paper-921.xml">Exploratory Text Analysis for Large Document Archives</a>.’ <em>Digital Humanities 2014</em>.</li>
        </ul>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>Takes input from Mallet.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
        interface that has two views:
        <br>
        <ul>
        <li><em>“Comparative research in thematic space” view</em> (figure on the top
            right) is designed to allow a researcher to compare the topical compositions
            of multiple different publications with serial issues (such as newspapers).
            To do so, it focuses at any one time on a selected set of publications
            in the corpus (e.g., different newspapers) and a selected set of topics
            of interest (e.g., topics 9-13 in the figure).
            <br> Each publication is represented as a worm-like trail composed of colored
            circles.</li>
        <li>Each circle corresponds to an issue of the publication;</li>
        <li>the color of a circle corresponds to the editor of that run of issues from
            the publication;</li>
        <li>the size of a circle represents proportionally how much it is infused by
            the selected set of topics the user is examining;</li>
        <li>the position of the circles on the screen (and thus the topography of the
            worm-like trail for a publication) is determined through a kind of push-pull
            physics of interaction between the topics being examined. (In detail: <em>“The topics, here represented as squares, exert a ‘magnetic’ force on each point, pulling each point closer to the topics that are more prominent in that issue. For example, if a given newspaper issue contained words from T9 and T13 in equal measure, with no indication of T10, T11, or T12, then the corresponding circle would be positioned in between the squares representing T9 and T13”</em>).</li>
        <li><em>Multimodal research in temporal space view</em> (figure on bottom right)
            is designed to allow a researcher to compare the trending of topics in
            time (and is not focused on specific publications).</li>
        <li>Typing a word into the search bar at the top of the view, brings up a panel
            list of topics (in the right) ranked by “relevancy” to the search-word
            (“relevance—the frequency with which the query appears in each topic”).</li>
        <li>The left panel shows the trends lines of the topics in time, with topics
            color-coded to match the list of relevant topics. In the trend lines, the
            thickness of a line indicates the relative weight of a topic at that time
            in the corpus.</li>
        <li>Note that “relevance” of a topic to a search-word and weight of the topic
            in the corpus have no necessary bearing on each other. A topic can be highly
            relevant to a word (if the word is very frequent in the topic), for example,
            but the topic as a whole, can be less frequent in the corpus.</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="https://github.com/GeorgiaTechDHLab/TOME">Github repo</a>      (Document file in the repo with instructions is <a href="https://github.com/GeorgiaTechDHLab/TOME/blob/master/README.docx?raw=true">here</a>.
        The doc begins: “These files are simple HTML files that call the online version
        of the D3.js library, as well as jQuery. Currently, a version of the D3.js
        library has been downloaded and saved into the file so the entire project can
        be pulled up on a local server. I ran this on my machine using the cmd line
        and python version 2.7 “)
        <br>
        <ul>
        <li>Usage and data format notes sent by Lauren Klein to Alan on 1/19/16:
            <br>


            <blockquote>
            “We also generate the topics before formatting them for visualization. We used MALLET
            at first, but then had to run a custom topic model since our corpus was
            so large. Once you have the topics, the format is just a five-column
            CSV, which you can see here:<a href="https://github.com/GeorgiaTechDHLab/TOME/blob/master/a_month_shorter3.csv">https://github.com/GeorgiaTechDHLab/TOME/blob/master/a_month_shorter3.csv</a>
            <br> At the moment, the relevance is hand-calculated for one keyword, since
            we haven’t built the keyword search function yet, but everything else
            should be fairly self-explanatory.
            <br> One of the things I’d like to do soon— (which may be a good option for
            you) is if you’d like to avoid interacting with MALLET directly, hook
            our interface into an instance of Bookworm, since one of the things that
            Bookworm does well is offer an API (of sorts) for extracting all sorts
            of info from a corpus, including topics. It requires a really huge amount
            of disk space, since it tokenizes (or otherwise indexes) every single
            word in the corpus beforehand. But in theory, if you can get everything
            processed, (and I’ve also had some issues with the initial install, which
            I haven’t had time to resolve), the actual hooking into whatever interface
            you develop should be relatively easy.”</blockquote>
        </li>
        </ul>
    </li>
    <li><strong>Notes by WE1S team</strong>:</li>
    </ul>


    <h2 id="screen-shots-9">Screen Shots</h2> [![TOME “Trail of Dust” view](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-1-th.jpg)](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-1.jpg)[![TOME
    “Multimodal” and Timeline View](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-2-th.jpg)](https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/tome-2.jpg)

    <hr>


    <h2 id="the-topic-browser"><a>The Topic Browser</a></h2>

    <ul>
    <li><strong>Description</strong>: Gardner, Matthew J, et al., <a href="http://cseweb.ucsd.edu/~lvdmaaten/workshops/nips2010/papers/gardner.pdf">“The Topic Browser: An Interactive Tool for Browsing Topic Models”</a>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>Topic Browser appears to input data from Mallet.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
        interface:
        <br>
        <ul>
        <li>The main navigation and exploring tool is a ranked list of topics labeled
            by the top two words in each topic <em>(seen on the top figure to the right)</em>.
            (The interface also allows the user to navigate by documents, words, and
            “attributes” [metadata].)</li>
        <li>Because the Topic Browser “incorporates three other pieces of information:
            attributes (metadata) associated with each document, topic metrics, and
            document metrics,” it can use those metrics to rank and filter the topic
            list in various ways to enhance interpretation – e.g., by “simple metrics,
            such as the number of word tokens and types labeled with the topic, to
            more complicated metrics such as how dispersed the topic is across documents,
            or how coherent its words are.”</li>
        <li>“When browsing through topics, the user can filter the topic list by coherence
            to eliminate from the view those topics that are mostly meaningless and
            sorted by document entropy (a measure of the dispersion of the topic across
            the documents) to find topics that were used widely throughout the corpus.”</li>
        <li>The interface can also show a concordance-like view of topic words in contest,
            topic word clouds, top 10 documents associated with each topic, top words
            in a topic, etc. <em>(see the other panels on the top figure to the right)</em>.</li>
        <li>The interface also can plot topics (showing as bars whose length on the Y
            axis indicates weight) against attributes (metadata) (labeled on the X
            axis). See for example <em>the bottom figure on the right</em>, which plots
            the incidence of three different topics (colored blue, green, red) in documents
            by a number of different politicians (labeled on the X axis). Using date
            metadata would produce a topic timeline.</li>
        <li>“The second kind of plot that we include is more useful for analyzing and
            understanding the behavior of the topic model itself. We allow the user
            to plot two topic metrics against each other and compute a linear regression.
            This allows the user to see some interesting properties of the topic metrics,
            such as the fact that document entropy seems to correlate with the logarithm
            of the number of tokens in the topic, and that coherence does not seem
            to correlate with any other topic metric. The user can also find outliers,
            such as topics with low document entropy but a high token count, that can
            then be examined in the topic page.”</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="http://nlp.cs.byu.edu/topic%20browser">obsolete code site</a>      (no new site found)</li>
    <li><strong>Notes by WE1S team</strong>:
        <br>
        <ul>
        <li><em>Alan</em>: The Topic Browser (whose code site can no longer be found)
            may have be deprecated in favor of the Topical Guide from the same research
            unit, the <a href="http://wiki.cs.byu.edu/nlp/start">Natural Language Processing Lab</a>          of the Brigham Young University Computer Science Department. See below
            for <a href="#topical-guide">Topical Guide</a>.</li>
        </ul>
    </li>
    </ul>


    <h3 id="screen-shots-10">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-browser-1.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-browser-1-th.jpg"
        alt="Topic Browser - overall topic view" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-browser-2.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-browser-2-th.jpg"
        alt="Topic Browser - plot over attributes view" title="">
    </a>
    </p>

    <hr>


    <h2 id="topical-guide"><a>Topical Guide</a></h2>

    <ul>
    <li><strong>Description</strong>: <a href="https://github.com/BYU-NLP-Lab/topicalguide/wiki">Guide (wiki)</a>      | <a href="http://tg.byu.edu">Demo</a> (to run the demo, first select the “State
        of the Union” addresses corpus, then select the “LDA with 100 topics” analysis,
        then choose among the “topics”, “documents”, and “visualizations” views).</li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>A start-to-end system that inputs a corpus of plain-text files (with some
            metadata added at the top of each file in text form) (see their example
            text files <a href="https://github.com/BYU-NLP-Lab/topicalguide/tree/master/default_datasets/state_of_the_union/documents">here</a>).
            The system then topic models the corpus, opens up a local server, and presents
            it in a dynamic visualization interface.</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
        interface with several views:
        <br>
        <ul>
        <li><em>Topic view</em> (1st figure to the right): Shows in column form: topics,
            topic weights, topic labels, top words in topics (with each column sortable
            by rank).</li>
        <li><em>Document view</em> (2nd figure to the right): Shows text of documents
            (with or without highlighting of topics and topic words); and also shows
            the metadata concerning each text (including metrics for length, number
            of tokens, and topic entropy).</li>
        <li><em>2D Plots Visualization view</em> (3rd figure to the right): Plots any
            two kinds of metadata (e.g., year or author of a document), metrics (e.g.,
            document length, topic entropy), or topic against each other, setting one
            on the Y axis, the other on the X axis.</li>
        <li><em>“Chord” Diagram Visualization view</em> (4th figure to the right): arcing
            chords from one topic positioned on the perimeter of the circle to another
            topic elsewhere on the circle show correlation (and correlation strength)
            either by documents (number of docs sharing the topic) or words (number
            of words the topics share).</li>
        <li><em>“Topics over time” visualization view</em> (5th figure to the right):
            Show proportional weight of topics on the Y axis plotted against year on
            the X axis (or are optionally other available metadata, such as month,
            author, title of a work, etc.).</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: <a href="https://github.com/BYU-NLP-Lab/topicalguide">Github repo &amp; instructions</a>      (a Python library; Python version not specified)</li>
    <li><strong>Notes by WE1S team</strong>:
        <br>
        <ul>
        <li><em>Alan</em>: to make best use of Topical Guide, we would want to insert
            some metadata at the top of each of our plain-text document files, which
            we could do automatically either by scripting information from our scraping
            spreadsheets, or by re-harvesting the text and other data from our spreadsheets.
            For example, we could insert at metadata in our files: author, year, month,
            newspaper title, country, etc.</li>
        </ul>
    </li>
    </ul>


    <h3 id="screen-shots-11">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-1-topic-view.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-1-topic-view-th.jpg"
        alt="Topical Guide - topic view" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-2-document-view.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-2-document-view-th.jpg"
        alt="Topical Guide - document view" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-3-viz-2d-view.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-3-viz-2d-view-th.jpg"
        alt="Topical Guide - 2D plots  visualization view" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-4-viz-chord-view.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-4-viz-chord-view-th.jpg"
        alt="Topical Guide - &quot;chord&quot; diagram visualization view" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-5-viz-topics-over-time-view.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topical-guide-5-viz-topics-over-time-view-th.jpg"
        alt="Topical Guide - &quot;topics over time&quot; visualization view" title="">
    </a>
    </p>


    <h2 id="topicnets"><a>TopicNets</a></h2>

    <ul>
    <li><strong>Description</strong>: Gretarsson, Brynjar, et al. (research teams from
        UCSB and UCI) (2012), <a href="http://www.cs.ucsb.edu/~holl/pubs/Gretarsson-2011-ACMTIST.pdf">“TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling”</a>;
        see also <a href="http://cs.ucsb.edu/~jod/topicnets.html">John O’Donovan’s page</a>      (UCSB CS dept.)
        <br>
        <ul>
        <li><em>Abstract from the article above</em>: “We present TopicNets, a web-based
            system for visual and interactive analysis of large sets of documents using
            statistical topic models. A range of visualization types and control mechanisms
            to support knowledge discovery are presented. These include corpus and
            document specific views, iterative topic modeling, search, and visual filtering.
            Drill-down functionality is provided to allow analysts to visualize individual
            document sections and their relations within the global topic space. Analysts
            can search across a data set through a set of expansion techniques on selected
            document and topic nodes. Furthermore, analysts can select relevant subsets
            of documents and perform real-time topic modeling on these subsets to interactively
            visualize topics at various levels of granularity, allowing for a better
            understanding of the documents.”</li>
        </ul>
    </li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>TopicNets appears to be a start-to-finish system that does its own topic
            modeling (or perhaps incorporates Mallet in its processing sequence).</li>
        </ul>
    </li>
    <li><strong>Notable interpretive features of interface</strong>: A dynamic visualization
        interface incorporation a re-iterable real-time topic modeling process to allow
        for adjustment of the model.
        <br>
        <ul>
        <li>The <em>main visualization metaphor</em> is a social network graph whose
            nodes are topics, documents, and/or entities (e.g., individual or corporate
            authors, institutions, and other entities sourced from metadata or derived
            in some other way). Edges and proximity represent the relation between
            those node types–e.g., between documents and topics. “A novel aspect of
            the layout includes the use of topic similarity to determine node positions,
            thus creating visual clusters of topically similar documents.”</li>
        <li><em>Real-time rerunning of topic models</em>: “A key contribution of this
            work is that TopicNets supports fast, iterative topic modeling which can
            be ran directly from the interactive visualization in near real time to
            provide better insight into subsets of interest within the larger corpus.”</li>
        <li>In detail: the topic modeling of very large corpora is done in advance or
            offline, and then iterative real-time topic modeling can be done on smaller
            subsets of the data.</li>
        <li>Also, the system can section large documents for topic modeling.</li>
        <li><em>Details on demand</em>: “When a user selects a node in the graph all
            the details of that node can be seen in a panel on the right hand side.
            If the selected node is a document node, a link to view that document is
            provided. If the selected node is a topic node, then links to view all
            the connected documents are provided. Additional options such as select
            all neighbor nodes and delete node are also provided on this panel.”</li>
        <li><em>Filterable graphs</em>: “In many cases an analyst is interested in a
            target subsection of the graph. TopicNets provides multiple methods for
            filtering the graph in order to get to a visualization of the target information.
            The user can type in any text and perform a search over the node labels,
            the top 10 words of all the topics, and/or entire documents. Once the search
            has completed the user can select a few of the returned nodes from a list,
            or simply select them all. Once the user has selected a set of nodes of
            interest he/she can click on a button to visualize only the selected nodes
            and their immediate neighbors in the graph. Once that button is pressed,
            all the other nodes are removed from the graph and a new layout is computed.
            An alternative filtering method we provide is to manually select one or
            more nodes in the graph and then make the system remove all nodes not in
            or connected to the selected nodes.”</li>
        <li><em>Graph manipulation by user</em>: “TopicNets provides a graph interaction
            mechanism, which allows an analyst to directly manipulate the visualization
            by clicking on nodes and moving them around the screen. Single or multiple
            nodes can be selected through search or a ctrl-click and subsequently moved
            on the screen by click-and-drag mouse movement. Using this technique, analysts
            can mold the graph to highlight interesting features. For example, selecting
            all documents by a given author and dragging them to one side will deform
            most graphs into a tree-like layout in which collaborators with the target
            author are easily identifiable.”</li>
        </ul>
    </li>
    <li><strong>Code site</strong>: [unknown, but probably findable through colleagues
        at UCSB]</li>
    <li><strong>Notes by WE1S team</strong>:</li>
    </ul>


    <h3 id="screen-shots-12">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-nets-1.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-nets-1-th.jpg"
        alt="TopicNets" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-nets-2.jpg"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/topic-nets-2-th.jpg"
        alt="TopicNets - topic similarity view" title="">
    </a>
    </p>

    <hr>


    <h2 id="twic-topic-words-in-context"><a>TWIC (Topic Words in Context)</a></h2>

    <ul>
    <li><strong>Description</strong>: From Jonathan Armoza’s <a href="https://github.com/jarmoza/twic">Github repository</a>      for TWIC: “Data visualization created by Jonathan Armoza that provides hierarchical,
        top-down and bottom-up views of LDA topic models generated by the popular topic
        modeler, MALLET. TWiC was born out of the need for digital humanities researchers
        to understand how the topic word list outputs from topic modelers like MALLET
        relate to the texts they attempt to model. In order to do so, the visualization
        presents users with multiple, related views that portray how topics are distributed
        throughout the entire collection of texts being modeled, looking from all the
        way above - the “big data” view - and diving downward to where topic words
        are situated in their original contexts in the texts themselves. Each successive
        view as one browses the visualization, is then considered part of the larger
        whole of the previous view, or in other words, a subset of texts “underneath”
        the previous view.”</li>
    <li><strong>Topic modeling workflow</strong>:
        <br>
        <ul>
        <li>A .</li>
        <li><strong>Notable interpretive features of interface</strong>: (descriptions
            excerpted fom Jonathan Armoza’s <a href="https://github.com/jarmoza/twic">Github repository</a>)
            <br>
            <ul>
            <li><em>Data Shapes”:</em>
            </li>
            <li><em>Topic Bullseye:</em> “a bullseye-like abstraction of the top N topics
                of the texts are being represented. In the image above, this would
                represent the top 10 topics of a group of texts. Each ring represents
                a topic and is given a randomly assigned and unique color. Moving inward
                from the outside of the shape toward the center, more significant topics
                are represented, until one arrives at the center, or, the top topic.”</li>
            <li><em>Topic Rectangle</em>: “A shape that utilizes the same paradigm as
                the topic bullseye, where outer rings (in this case, outer rectangles)
                represent the top N topics of an individual text. In the image above,
                this would represent the top 5 topics of one text. At the center of
                the topic rectangle is a miniature, programmatically-derived (and truncated)
                view of the text itself and its topic words, each word represented
                by colored squares among the initial lines of the text. Uncolored words
                (in a light goldenrod) are words that have been ignored by MALLET.”</li>
            <li>“Graphical Panels” – “Graphical panels present TWiC’s data shapes and
                their relationships to one another in the model and context of the
                collection. Informational panels show metadata about those shapes,
                like the topic word lists themselves and their proportions throughout
                the various views/levels of the collection. Graphical panels also reveal
                information about each other too. With several on the screen, by mousing
                over data shapes and double-clicking to open underlying views, the
                user can not only understand the distribution of topics at a particular
                viewing height in the collection, but also the relation between the
                topic distribution as seen in those two or more vantages.”</li>
            <li><em>Corpus view</em> – “Shows top 10 corpus-level topics of a corpus</li>
            <li><em>Corpus cluster view</em> – “Texts clustered by their top topic and
                seen as topic bullseyes. Each bullseye is placed at a distance from
                the corpus’s average topic distribution”</li>
            <li><em>Text cluster view</em> – “An example of texts/topic rectangles are
                viewed underneath a corpus cluster’s topic bullseye. Each text is set
                at a distance from the cluster’s average topic distribution”</li>
            <li><em>Individual text view</em> – “A text as selected and viewed from the
                previous text cluster panel”</li>
            <li>“Informational Panels”</li>
            <li><em>Topic Bar</em> – “Displays and highlights all topic word lists of
                the model”</li>
            </ul>
        </li>
        <li><strong>Code site</strong>: <a href="https://github.com/jarmoza/twic">Github repo &amp; instructions</a>
        </li>
        </ul>
    </li>
    <li><strong>Notes by WE1S team</strong>:</li>
    </ul>


    <h3 id="screen-shots-13">Screen Shots</h3>

    <p>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-1-bullseye.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-1-bullseye-th.png"
        alt="" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-2-rectangle.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-2-rectangle-th.png"
        alt="" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-3-corpus-cluster.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-3-corpus-cluster-th.png"
        alt="" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-4-text-cluster.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-4-text-cluster-th.png"
        alt="" title="">
    </a>
    <a href="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-5-individual-text.png"><img src="https://raw.githubusercontent.com/whatevery1says/dev_resources/master/report-on-topic-modeling-interfaces/assets/twic-5-individual-text-th.png"
        alt="" title="">
    </a>
    </p>

    <hr>


    <h2 id="other-topic-model-interfaces-hypothesized-or-mocked-up-in-the-research-literature-further-away-from-possible-implementation-by-we1s"><a>Other topic model interfaces hypothesized or mocked up in the research literature (further away from possible implementation by WE1S)</a><a></a></h2>

    <ul>
    <li>Smith, Alison, et al (2014), <a href="https://www.aclweb.org/anthology/W/W14/W14-3112.pdf">“Concurrent Visualization of Relationships between Words and Topics in Topic Models”</a>
    </li>
    <li>Veas, Eduardo, and Cecilia di Sciascio (2015), <a href="http://cognitum.ws/wp-content/uploads/2015/06/VeasDiSciascio2015.pdf">“Interactive Topic Analysis with Visual Analytics and Recommender Systems”</a>
    </li>
    </ul>

    <hr>


    <h2 id="evaluation-of-topic-models"><a>Evaluation of Topic Models</a></h2>

    <ul>
    <li>Chang, Jonathan, et al. (2009), “<a href="http://www.cs.columbia.edu/~blei/papers/ChangBoyd-GraberWangGerrishBlei2009a.pdf">Reading Tea Leaves: How Humans Interpret Topic Models”</a>
    </li>
    </ul>

    <p>Option 2: Or upload your HTML document Choose File Indentation level: FORMAT HTML
    FORMAT HTML IN NEW WINDOW Formatted HTML:
    <br> The 4Humanities “WhatEvery1Says” project conducted a comparative analysis in
    2016 of the following topic modeling systems/interfaces. As a result, it chose
    to implement Andrew Goldstone’s DFR-browser for its own work.</p>

    <p>Last revised November 26, 2017.</p>
</body>

</html>