This repository contains topic segmentations of transcripts of the following interviews from the Oral Histories of the American South project:
- U-0005
- U-0007
- U-0008
- U-0011
- U-0012
- U-0014
- U-0017
- U-0019
- U-0020
- U-0023
- U-0098
- U-0178
- U-0180
- U-0181
- U-0183
- U-0184
- U-0185
- U-0186
- U-0193
The sentence-tokenized text of the transcripts can be found in the interview-sentences
directory.
All the following files contain linear segmentations serialized as JSON in the Segeval format.
All human-created segmentations. Each segmentation is sentence-based, i.e. the segment masses are number of sentences per segment.
segmentations-by-speaker-turns.json
All human-created segmentations, each segmentation is speaker-turn-based, i.e. the segment masses are number of speaker turns per segment. This was generated using the script project.py
and the data files segmentations.json
(see above) and speakers.json
(see below).
project.py segmentations.json speakers.json
Segmentations with boundaries placed at each point there is a speaker change.
Segmentations with boundaries placed at each point there is a speaker change or paragraph break in the original interview transcript.
segmentations-
coder name.json
segmentations-
coder name-by-speaker-turns.json
These are simply subdivisions of segmentations.json
into files that only contain segmentations from a single coder, and their speaker-turn-based variations.
segmentations-null.json
segmentations-null-by-speaker-turns.json
A null (no boundaries) segmentation generated using nullseg.py
and segmentations.json
, and it's speaker-turn-based version.
nullseg.py segmentations.json
segmentations-random.json
segmentations-random-by-speaker-turns.json
A random segmentation generated using randomseg.py
and segmentations.json
, and it's speaker-turn-based version.
randomseg.py segmentations.json
segmentations-bayes-
settings.json
Segmentations produced by my fork of the Java code from the 2008 EMNLP paper "Bayesian Unsupervised Topic Segmentation" by Eisenstein and Barzilay.
Directory with a large number of segmentations generated by running TextTiling with various different parameter settings. The filenames reflect the values assigned to the following parameters:
w
: pseudosentence (token sequence) sizek
: block sizem
: depth score cutoff is(mean(depth) - m*(stddev(depth))
n
: number of rounds of smoothings
: smoothing width