$ ./gradlew build
Compiles the custom SpeakerAnnotator.
$ ./gradlew installApp
This puts all the necessary JARs in build/install/corenlp/lib/
.
$ ./pipeline.sh
This script will run the pipeline on all the files listed in
filelist.txt
. Annotated files will be written to data/out
.
The following options can be configured in config.props
:
threads
number of cores to usefilelist
list of input filescustomAnnotatorClass.speaker
name of the class implementing thespeaker
annotatorannotators
comma-separated list of annotators to include in the pipelinessplit.eolonly
expect input that is already sentence-tokenizedoutputDirectory
directory to put annotated documents inreplaceExtension
name output files like[document id].xml
Input files must be placed in the directory data/in/
and added to
filelist.txt
. Your input files must be plain text files containing
one sentence per line, wrapped in a single doc
XML element with an
id
attribute identifying the document.
For example:
<doc id="U-0012">
In response to your first direction and then you can remind me of the others.
Okay.
[ many sentences snipped ... ]
And your help.
To you.
</doc>
splitta
can be used to
sentence-tokenize input files:
$ cat U0080_Transcript.txt | ./bin/splitta > U-0080.txt
The add-doc-ids.sh
script can handle wrapping plain text files in doc
elements:
./add-doc-ids.sh data/in/*.txt
The command above will create an XML file alongside each TXT file.
For the SpeakerAnnotator to work, there must be an additional input
file for each document, named [document id].speakers
. So, for
example, for document with id U-0012
shown above, the
SpeakerAnnotator will expect to find a file
data/in/U-0012.speakers
. This file is a plain text file with one
speaker name per line. It should have exactly the same number of lines
as there are sentences in input document.
A speakers file can be created for one of the DocSouth interviews in
the Redis database by using the dump-speakers.py
script in the
evaluation
project:
$ ./dump-speakers.py U-0012 > U-0012.speakers
A speakers file can be created for a sentence-tokenized SOHP
transcript by using the print-speakers.py
script in this project:
$ ./print-speakers.py U-0080.txt > U-0080.speakers
Output files will appear in the data/out
directory. To see a
human-readable view of the coreference resolution:
./print-coreferences.py data/out/U-0012.xml