Fork of the Penn Treebank tokenizer
Original tokenizer written by Robert MacIntyre, University of Pennsylvania, late 1995
Original available at: http://www.cis.upenn.edu/~treebank/tokenizer.sed
Updated to:
- fix 'comma in number' handling
- fix open/close quote handling
- generalize tokenization to documents with directional quotes
- handle additional contractions
- add an untokenizer to untokenize arbitrary documents to their original form