Ngrams #13

murtuzamdahod · 2020-09-30T22:14:17Z

Does this model takes care of ngrams like "hot dog" = "hotdog", "ice cream" = "icecream"??

I have these ngrams in my training data

Also what if i want to remove words which are not corrected and not even in my vocabulary?
For eg:

IN : "Cheese hot dog abcd"
OUT: "Cheese hotdog"

chiragjn · 2020-10-01T07:42:25Z

Unfortunately, no. In current state spello operates on unigrams and can't perform any n-gram normalisation
We would gladly accept such an enhancement.

The latter part should be some what easy - with some modifications you can get the internal vocabulary of the model and then exclude words with a simple filter. spello does not do it by default because it might end up deleting some important context (domain-specific jargons, etc)

murtuzamdahod · 2020-10-01T09:36:36Z

Thank you for your response. Then maybe I can build an ngram model on top of it.

The major issue is that words like "panii puri" should be corrected to "pani puri" / "panipuri" as per the context. But it gives me "panini puri". I have trained spello on my dataset of around 20 lakh rows (3 lakh unique).

chiragjn · 2020-10-01T09:53:56Z

Interesting,
Two questions:

does panii occur in your train set, if yes what is the count?
does pani puri occur at least once in your train set?

because at the moment spello does not attempt to correct a word if it is in the vocabulary of the trained model. That is also one of the enhancements which would be welcome.
https://github.com/hellohaptik/spello#future-scope--limitations
Fixing grammatical mistakes and replacing legit words with contextually sensible words would definitely require more intelligence.

If panii does not occur in your training set, then that is surely a bug and we would like to fix it.
If possible, maybe you can provide us with only sentences that contain panii, pani, puri so we might try re-producing.

murtuzamdahod · 2020-10-01T10:03:43Z

"panii " should not be there in my vocabulary because then only it gets corrected to "panini". But I am sure I don't have "panini puri" in my dataset :P
So as per the context, "panii puri" should be "pani puri".

Earlier, I was just using regexp to map the ngrams that I require manually. Now, I will need to look into building an n-gram model and see if it works well.

chiragjn added the enhancement New feature or request label Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ngrams #13

Ngrams #13

murtuzamdahod commented Sep 30, 2020

chiragjn commented Oct 1, 2020

murtuzamdahod commented Oct 1, 2020

chiragjn commented Oct 1, 2020

murtuzamdahod commented Oct 1, 2020

Ngrams #13

Ngrams #13

Comments

murtuzamdahod commented Sep 30, 2020

chiragjn commented Oct 1, 2020

murtuzamdahod commented Oct 1, 2020

chiragjn commented Oct 1, 2020

murtuzamdahod commented Oct 1, 2020