docsouth-import

This is a tool to import XML transcripts (in a particular subset of the TEI.2 specification) from the DocSouth dataset into the Redis database format used by the contours/segment annotation tool.

The main script is docsouth-to-redis.py. See the comments at the top of that file for usage information. You will likely also need to change SVM_LEARN and SVM_CLASSIFY in sbd.py to set the location of your SVM-Light executables.

sbd.py, sbd_util.py, and word_tokenize.py are the Splitta sentence boundary detection tool by Dan Gillick. The model_svm contains the SVM-based sentence boundary model from the same tool.

The docsouth directory contains the gzipped XML data from DocSouth, with minor modifications. A handful of the files had issues such as one of the speaker definitions being missing, or some of the speechblock tags missing the speaker ID reference. These errors were unambiguous and were corrected manually.

docsouth.dump.xz is the xz-compressed output of a complete run of this import tool. It consists of plain text (UTF-8) Redis database commands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docsouth-import

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docsouth		docsouth
model_svm		model_svm
README.md		README.md
docsouth-to-redis.py		docsouth-to-redis.py
docsouth.dump.xz		docsouth.dump.xz
sbd.py		sbd.py
sbd_util.py		sbd_util.py
word_tokenize.py		word_tokenize.py

contours/docsouth-import

Folders and files

Latest commit

History

Repository files navigation

docsouth-import

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages