GitHub - contours/u-series-segmentations: Topic segmentations of interviews in the DocSouth U-series

This repository contains topic segmentations of transcripts of the following interviews from the Oral Histories of the American South project:

The sentence-tokenized text of the transcripts can be found in the interview-sentences directory.

All the following files contain linear segmentations serialized as JSON in the Segeval format.

segmentations.json

All human-created segmentations. Each segmentation is sentence-based, i.e. the segment masses are number of sentences per segment.

segmentations-by-speaker-turns.json

All human-created segmentations, each segmentation is speaker-turn-based, i.e. the segment masses are number of speaker turns per segment. This was generated using the script project.py and the data files segmentations.json (see above) and speakers.json (see below).

project.py segmentations.json speakers.json

speakers.json

Segmentations with boundaries placed at each point there is a speaker change.

speechblocks.json

Segmentations with boundaries placed at each point there is a speaker change or paragraph break in the original interview transcript.

segmentations-coder name.json
segmentations-coder name-by-speaker-turns.json

These are simply subdivisions of segmentations.json into files that only contain segmentations from a single coder, and their speaker-turn-based variations.

segmentations-null.json
segmentations-null-by-speaker-turns.json

A null (no boundaries) segmentation generated using nullseg.py and segmentations.json, and it's speaker-turn-based version.

nullseg.py segmentations.json

segmentations-random.json
segmentations-random-by-speaker-turns.json

A random segmentation generated using randomseg.py and segmentations.json, and it's speaker-turn-based version.

randomseg.py segmentations.json

segmentations-bayes-settings.json

Segmentations produced by my fork of the Java code from the 2008 EMNLP paper "Bayesian Unsupervised Topic Segmentation" by Eisenstein and Barzilay.

texttiling-parameter-sweep

Directory with a large number of segmentations generated by running TextTiling with various different parameter settings. The filenames reflect the values assigned to the following parameters:

w : pseudosentence (token sequence) size
k : block size
m : depth score cutoff is (mean(depth) - m*(stddev(depth))
n : number of rounds of smoothing
s : smoothing width

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
interview-sentences		interview-sentences
texttiling-parameter-sweep		texttiling-parameter-sweep
README.md		README.md
dp-baseline.json		dp-baseline.json
dp-lda.json		dp-lda.json
segmentations-bayes-nn-vb-jj-stem-α0.136.json		segmentations-bayes-nn-vb-jj-stem-α0.136.json
segmentations-bayes-stem-stop-α0.189.json		segmentations-bayes-stem-stop-α0.189.json
segmentations-brennan-by-speaker-turns.json		segmentations-brennan-by-speaker-turns.json
segmentations-brennan.json		segmentations-brennan.json
segmentations-by-speaker-turns.json		segmentations-by-speaker-turns.json
segmentations-docsouth-by-speaker-turns.json		segmentations-docsouth-by-speaker-turns.json
segmentations-docsouth.json		segmentations-docsouth.json
segmentations-mannheim-by-speaker-turns.json		segmentations-mannheim-by-speaker-turns.json
segmentations-mannheim.json		segmentations-mannheim.json
segmentations-null-by-speaker-turns.json		segmentations-null-by-speaker-turns.json
segmentations-null.json		segmentations-null.json
segmentations-random-by-speaker-turns.json		segmentations-random-by-speaker-turns.json
segmentations-random.json		segmentations-random.json
segmentations.json		segmentations.json
speakers.json		speakers.json
speechblocks.json		speechblocks.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

contours/u-series-segmentations

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages