This repository has been archived by the owner on Dec 24, 2024. It is now read-only.

Unsupervised statistical root extraction without a dictionary #25

Open

aliok opened this issue Dec 10, 2012 · 0 comments

Labels

Owner

aliok commented Dec 10, 2012

Brute force root extractors already exist. However, the results are too much and it is better to do it statistically

This might be useful for finding roots that doesn't exist in the dictionary (e.g. local words) and proper nouns.

Save the possible roots for a big corpus (10M words) in a file
...

For proper noun recognition

check if the root has been used with a apostrophe in the corpus
or check if the word starts with upper case in the middle of a sentence in the corpus

For e.g. verb recognition: for non-dictionary word 'kıvışlıyordu' find the root as 'kıvışlamak'

Check if there are other surfaces with root candidates as "kıvışlamak", such as 'kıvışladım' 'kıvışla' 'kıvışlarsa'
Then we would eliminate the some of the candidates : 'kıvışlımak' 'kıvışlıyormak' 'kıvışlıyomak' etc.
However, it doesn't eliminate the roots such as "kıvış+Noun" 'kıvmak' 'kıvımak' etc.
For them, check if there is other surfaces such as 'kıvışımı' 'kıvdım' 'kıvıyorum' etc.

That would help a lot.

aliok mentioned this issue

Phrase recognition and database #28

Open

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.