Frame Generator

Tool for extracting topics, keywords and their co-occurence patterns (forming so-called frames) from a Dutch corpus.

Background

The Frame Generator was created in collaboration with prof. dr. Joris van Eijnatten (UU) during his KB Digital Humanities Fellowship in 2016. Its aim is to meaningfully reducing a set of texts to word patterns that cut across the distributions generated by topic modelling, thus providing additional insight into the content of the data set.

Features

Generate topics with the Mallet or Gensim topic modelling library
Extract a single, ranked list of keywords based on either topics or tf-idf scores
Find co-occurrence patterns for the keywords in the texts from which they were originally extracted
Optionally lemmatize and pos-tag the input texts with NLP suite Frog and restrict the keywords and collocates to specific part-of-speech tags
Optionally correct OCR errors and spelling variations with user-provided lists of regular expressions
Access and reuse the results of each processing stage as comma-separated values files
Use from the command line, as a Python library or web application

Requirements

Python 2.7
Mallet 2.0.7

Installation

Clone or download the GitHub repository:

$ git clone https://github.com/jlonij/frame-generator

Install the required Python packages:

$ cd frame-generator
$ pip install -r requirements.txt

Optional install Frog + Frog wapper (Frogger) locally.

Since Frog causes heavy load-spike's on our infrastructure availability cannot be guaranteed. The endpoint to which frame-generator will try to connect is a demo sever, which might be down, if you want to analyze a lot of data, change the endpoint here:

https://github.com/KBNLresearch/frame-generator/blob/master/frame-generator/documents.py#L35

Install Frog + Dependencies: See here: https://github.com/LanguageMachines/frog

Start frog:

$ frog -S 4096

Install Frog-wrapper:

Place the directory frogger in your Apache2 www-root (/var/www/frogger/), it will use a .htaccess file to launch the application from Apache.

Test the wrapper without HTTP:

$ mv frogger /var/www/; cd /var/www/frogger; python frog.py

This should ouput some test text if all went well, now try the wrapper with HTTP:

$ curl -s http://localhost/frogger/?text="Dit is een test"

This sould return some test text.

Usage

Basic command line execution with the default values for all options:

$ cd frame-generator
$ ./generator.py

This will generate keywords and frames for the sample documents provided and print the results:

Keywords and frames generated:
(1) zotheid/N [0.153417875946]
prijzen/WW (1.81959197914), verkondigen/WW (0.778800783071), kwaden/ADJ (0.472366552741), gepast/ADJ (0.367879441171), staan/WW (0.28650479686), aanstonds/ADJ (0.28650479686)
(2) god/N [0.0681857686941]
goddelijk/ADJ (0.606530659713), stellen/WW (0.606530659713), vervroolijk/ADJ (0.472366552741), uitzonderen/WW (0.367879441171)
(3) mensch/N [0.0271176554179]
vervroolijk/ADJ (0.778800783071), toeschrijven/WW (0.778800783071), gewoonlijk/ADJ (0.606530659713), verjagen/WW (0.367879441171), goddelijk/ADJ (0.367879441171), spreken/WW (0.28650479686)
...

Command line interface

Input files are expected to be either utf-8 or iso-8859-1 encoded and have to be placed in the appropriate frame-generator/input subdirectories:

docs contains plain text files with .txt extension of the documents to be processed.
stop contains optional stop word lists to be applied when creating the vocabulary. The stop word lists should be plain text files with .txt extension in which each word occupies a single line.
regex contains optional lists of regular expressions to be replaced in the input documents. These lists should have a .tsv extension and consist of two tab-separated columns, the first containing the regular expression and the second its intended replacement.

The Frame Generator command line interface accepts a number of options to control the process:

--gtype: the type of results to be generated. The user can choose between generating topics, keywords or frames. The default value is frames.
--dlen: the number of sentences the subdocuments used as units of analysis are to contain. Default value is 0, in which case the original, unsplit documents will be used.
--nopos: when this option is entered (no value required) the part-of-speech tagging functionality of the Frame Generator is turned off. This saves a lot of processing time and allows the generator to run offline.
--tcount: the number of topics to be generated. Default value is 10.
--tsize: the number of words to be contained in each topic. Default value is 10.
--mallet: full path to the Mallet executable; if not provided, Gensim's LDA implementation will be used to generate topics.
--kmodel: model to be used for scoring keywords, either lda or tf-idf. By default lda is used, meaning keywords are extracted on the basis of a topic model.
--kcount: the number of keywords to be generated. Default value is 10.
--ktags: the part-of-speech tags to be included in the keyword list separated by spaces, e.g. ADJ N WW.
--wdir: the direction, left or right, of the keyword in which frame words are searched for. When omitted both directions are taken into account.
--wsize: the maximum word distance of a frame word to the keyword. Default value is 5.
--fsize: the maximum number of frame words to be generated with each keyword. Default value is 10.
--ftags: the part-of-speech tags to be included in the keyword list, e.g. ADJ N WW.

Values accepted as part-of-speech tags with the --ktags and --ftags options are the following main tags from the CGN tag set:

ADJ Adjective
BW Adverb
N Noun
SPEC Names and unknown
TSW Interjection
TW Numerator
VNW Pronoun
WW Verb

Application programming interface

The Frame Generator can be used from another Python script or the Python interpreter by calling the generate() function in the generator.py script:

>>> import generator
>>> _, keyword_list, frame_list = generator.generate()
>>> keyword_list.print_keywords()
Keywords generated:
(1) zotheid/N [0.169311243818]
(2) god/N [0.0725594158983]
(3) mensch/N [0.034531287966]
...

The arguments of the function correspond to the command line options listed above, the function signature being:

generate(gtype='frames', dlen=0, pos=True, tcount=10, tsize=10, mallet=None, kmodel='lda', kcount=10,
	ktags=[], wdir=None, wsize=5, fsize=10, ftags=[], input_dir='input', output_dir='output')

Web application

The Frame generator can also be run as a simple Bottle web application accepting post requests:

$ ./web.py

By default the service is started at http://localhost:8091/.

Demo

An online demo providing a graphical user interface to the Frame Generator’s main functionality and a basic visualization of the results is available at http://www.kbresearch.nl/frames/. The source code of the demo can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
frame-generator		frame-generator
frogger		frogger
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Frame Generator

Background

Features

Requirements

Installation

Optional install Frog + Frog wapper (Frogger) locally.

Usage

Command line interface

Application programming interface

Web application

Demo

About

Releases

Packages

Contributors 2

Languages

License

KBNLresearch/frame-generator

Folders and files

Latest commit

History

Repository files navigation

Frame Generator

Background

Features

Requirements

Installation

Optional install Frog + Frog wapper (Frogger) locally.

Usage

Command line interface

Application programming interface

Web application

Demo

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages