Skip to content

vansky/extended_penn_tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

extended_penn_tokenizer

Fork of the Penn Treebank tokenizer

Original tokenizer written by Robert MacIntyre, University of Pennsylvania, late 1995
Original available at: http://www.cis.upenn.edu/~treebank/tokenizer.sed

Updated to:

  • fix 'comma in number' handling
  • fix open/close quote handling
  • generalize tokenization to documents with directional quotes
  • handle additional contractions
  • add an untokenizer to untokenize arbitrary documents to their original form

About

Fork of the Penn Treebank tokenizer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published