-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ngrams #13
Comments
Unfortunately, no. In current state spello operates on unigrams and can't perform any n-gram normalisation The latter part should be some what easy - with some modifications you can get the internal vocabulary of the model and then exclude words with a simple filter. spello does not do it by default because it might end up deleting some important context (domain-specific jargons, etc) |
Thank you for your response. Then maybe I can build an ngram model on top of it. The major issue is that words like "panii puri" should be corrected to "pani puri" / "panipuri" as per the context. But it gives me "panini puri". I have trained spello on my dataset of around 20 lakh rows (3 lakh unique). |
Interesting,
because at the moment spello does not attempt to correct a word if it is in the vocabulary of the trained model. That is also one of the enhancements which would be welcome. If |
"panii " should not be there in my vocabulary because then only it gets corrected to "panini". But I am sure I don't have "panini puri" in my dataset :P Earlier, I was just using regexp to map the ngrams that I require manually. Now, I will need to look into building an n-gram model and see if it works well. |
Does this model takes care of ngrams like "hot dog" = "hotdog", "ice cream" = "icecream"??
I have these ngrams in my training data
Also what if i want to remove words which are not corrected and not even in my vocabulary?
For eg:
IN : "Cheese hot dog abcd"
OUT: "Cheese hotdog"
The text was updated successfully, but these errors were encountered: