Skip to content
This repository has been archived by the owner on Dec 24, 2024. It is now read-only.

Statistical word tokenizer (sentence to words) #27

Open
aliok opened this issue Dec 10, 2012 · 0 comments
Open

Statistical word tokenizer (sentence to words) #27

aliok opened this issue Dec 10, 2012 · 0 comments
Labels

Comments

@aliok
Copy link
Owner

aliok commented Dec 10, 2012

Rule based part is already available : https://github.com/aliok/trnltk/blob/master/trnltk/tokenizer/texttokenizer.py

It doesn't work good with:

  1. Abbreviations like M.Ö. or ing.
  2. Ordinals like 3.
  3. Roman numerals like III and III.
  4. Paranthesis such as "(abc"
  5. Some phrases which are multiple words but should be considered as one : "hafta sonu" => "hafta_sonu"
  6. Proper nouns which are multiple words but should be consireded as one : "İç Anadolu" => "İç_Anadolu"
  7. Duplications

Ideas:
while tokenization:

  1. Check if M.Ö. is used as an abbreviation
  2. This is rule based I think. A sentence almost never ends with a cardinal number.
  3. Need morphologic support for that first.
  4. Seems rule based
  5. After tokenization, can have a look if there is a phrase like that. If so, words could be merged
  6. Same as 5
  7. Issue Duplication recognition in tokenization #32 is related

for 5 and 6, see http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=221:Ayri-Yazilan-Birlesik-Kelimeler&catid=50:yazm-kurallar&Itemid=132

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant