Skip to content

contours/u-series-segmentations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains topic segmentations of transcripts of the following interviews from the Oral Histories of the American South project:

The sentence-tokenized text of the transcripts can be found in the interview-sentences directory.

All the following files contain linear segmentations serialized as JSON in the Segeval format.


segmentations.json

All human-created segmentations. Each segmentation is sentence-based, i.e. the segment masses are number of sentences per segment.


segmentations-by-speaker-turns.json

All human-created segmentations, each segmentation is speaker-turn-based, i.e. the segment masses are number of speaker turns per segment. This was generated using the script project.py and the data files segmentations.json (see above) and speakers.json (see below).

project.py segmentations.json speakers.json 

speakers.json

Segmentations with boundaries placed at each point there is a speaker change.


speechblocks.json

Segmentations with boundaries placed at each point there is a speaker change or paragraph break in the original interview transcript.


segmentations-coder name.json
segmentations-coder name-by-speaker-turns.json

These are simply subdivisions of segmentations.json into files that only contain segmentations from a single coder, and their speaker-turn-based variations.


segmentations-null.json
segmentations-null-by-speaker-turns.json

A null (no boundaries) segmentation generated using nullseg.py and segmentations.json, and it's speaker-turn-based version.

nullseg.py segmentations.json

segmentations-random.json
segmentations-random-by-speaker-turns.json

A random segmentation generated using randomseg.py and segmentations.json, and it's speaker-turn-based version.

randomseg.py segmentations.json

segmentations-bayes-settings.json

Segmentations produced by my fork of the Java code from the 2008 EMNLP paper "Bayesian Unsupervised Topic Segmentation" by Eisenstein and Barzilay.


texttiling-parameter-sweep

Directory with a large number of segmentations generated by running TextTiling with various different parameter settings. The filenames reflect the values assigned to the following parameters:

  • w : pseudosentence (token sequence) size
  • k : block size
  • m : depth score cutoff is (mean(depth) - m*(stddev(depth))
  • n : number of rounds of smoothing
  • s : smoothing width

About

Topic segmentations of interviews in the DocSouth U-series

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published