This repository has been archived by the owner on Dec 24, 2024. It is now read-only.

Statistical word tokenizer (sentence to words) #27

Open

aliok opened this issue Dec 10, 2012 · 0 comments

Labels

Owner

aliok commented Dec 10, 2012

Rule based part is already available : https://github.com/aliok/trnltk/blob/master/trnltk/tokenizer/texttokenizer.py

It doesn't work good with:

Abbreviations like M.Ö. or ing.
Ordinals like 3.
Roman numerals like III and III.
Paranthesis such as "(abc"
Some phrases which are multiple words but should be considered as one : "hafta sonu" => "hafta_sonu"
Proper nouns which are multiple words but should be consireded as one : "İç Anadolu" => "İç_Anadolu"
Duplications

Ideas:
while tokenization:

Check if M.Ö. is used as an abbreviation
This is rule based I think. A sentence almost never ends with a cardinal number.
Need morphologic support for that first.
Seems rule based
After tokenization, can have a look if there is a phrase like that. If so, words could be merged
Same as 5
Issue Duplication recognition in tokenization #32 is related

for 5 and 6, see http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=221:Ayri-Yazilan-Birlesik-Kelimeler&catid=50:yazm-kurallar&Itemid=132

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.