Skip to content

Import DocSouth XML-formatted transcripts into a Redis database in the Contours format

Notifications You must be signed in to change notification settings

contours/docsouth-import

Repository files navigation

docsouth-import

This is a tool to import XML transcripts (in a particular subset of the TEI.2 specification) from the DocSouth dataset into the Redis database format used by the contours/segment annotation tool.

The main script is docsouth-to-redis.py. See the comments at the top of that file for usage information. You will likely also need to change SVM_LEARN and SVM_CLASSIFY in sbd.py to set the location of your SVM-Light executables.

sbd.py, sbd_util.py, and word_tokenize.py are the Splitta sentence boundary detection tool by Dan Gillick. The model_svm contains the SVM-based sentence boundary model from the same tool.

The docsouth directory contains the gzipped XML data from DocSouth, with minor modifications. A handful of the files had issues such as one of the speaker definitions being missing, or some of the speechblock tags missing the speaker ID reference. These errors were unambiguous and were corrected manually.

docsouth.dump.xz is the xz-compressed output of a complete run of this import tool. It consists of plain text (UTF-8) Redis database commands.

About

Import DocSouth XML-formatted transcripts into a Redis database in the Contours format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages